Abstract

The Ames Housing Dataset is a detailed record of real estate information across properties located in Ames, Iowa, over transactions occurring in the period between 2006 and 2010. With 1,460 entries and 81 diverse features, this data set provides a sweeping view of the features of residential real estate, which will give details about zoning classification, lot dimensions, street access, and utilities. It extends to such details as quality and condition of the houses, year built, amenities, and physical and functional attributes of the properties. Covering a wide range of information from structural characteristics to neighbourhood features, this dataset is critical for carrying out detailed real estate market studies, further allowing statistical investigations and research purposes in price prediction or economic trends in the housing market. This serves to be an exemplary tool for real estate developers, economic forecasters, and academia in their urban planning and property valuation disciplines.

Problem Identification for Ames Housing Data Analysis

Background Research on the Ames Housing Dataset

The Ames Housing data is one of the few alternatives to the Boston Housing data, which is commonly used when teaching regression analysis in a course setting. This is a very rich dataset, with detailed, rich insights into the real estate market in Ames, Iowa from 2006 to 2010. It’s a tabular data of 1460 observations with 79 explanatory variables concerning various facets of residential properties. These variables range from the architectural specification of the material used to the condition of various components of a house and its environment, targeting to predict selling prices of homes.

Points of Domain Expertise

To effectively analyze the Ames Housing dataset and derive meaningful insights, expertise in several key domains is crucial:

  1. Real Estate Market Trends: Understanding both general and local real estate market trends in Ames, Iowa, including average pricing, popular buying areas, buyer preferences, and seasonal buying patterns.

  2. Construction and Housing Features: Familiarity with various construction elements such as foundation types, roofing materials, and exterior features, and how these affect property durability and pricing.

  3. Seasonal Impact on Sales: Awareness of seasonal variations in real estate transactions in the U.S., including the busiest seasons (spring and summer) with increased demand and peak prices, as well as the relatively slower seasons (fall and winter) with softer prices and longer time on the market.

  4. Zoning and Regulatory Compliance: Knowledge of local land-use regulations and zoning laws that can influence real estate development projects and property values.

  5. Economic Indicators: Understanding of local economic conditions affecting the housing market, including employment rates, average incomes, and measures of economic growth.

Understanding the Variables

The determination of real estate value in the Ames Housing dataset relies on several key variables:

  1. Physical Features: Variables like lot area, overall quality, overall condition, and year built directly influence property valuation based on the quality of materials, finish, and age of construction.

  2. Location: The variable Neighborhood categorizes houses into various parts of Ames, impacting price due to location desirability and local amenities.

  3. Size and Space: Features such as above-ground living area square footage (GrLivArea) and total basement square footage (TotalBsmtSF) are crucial indicators of property size and space.

  4. Amenities: Factors like the presence of fireplaces, garage size (GarageCars), and whether the property has a pool (PoolQC) contribute significantly to property value by enhancing amenities and lifestyle.

  5. Renovations and Upgrades: The remodel date (YearRemodAdd) is important, indicating recent changes or improvements that could materially impact the sale price, highlighting the significance of renovations and upgrades in determining property value.

Proposed Data Science Questions from the Dataset

Based on the above domain knowledge and dataset understanding, several analytical questions can be formulated:

  1. How do external features such as proximity and lot area influence the sale price of homes in Ames?

  2. What effects do renovations have on the sale price of a house?

  3. How does energy efficiency and utilities impact the sale price of a house?

  4. What is the impact of landscape and outdoor features on the sale price of a house?

  5. How do neighborhood amenities affect the sale price of a house?

  6. How do market dynamics influence the sale price of a house?

  7. How does seasonal trends affect sale price of houses in Ames?

  8. How do quality and condition of a house impact Sale Price of Houses in Ames?

  9. What is the relationship between having a garage and the Sale Price of Houses in Ames?

Addressing these questions through detailed data analysis will allow for effective modeling and prediction of housing prices, providing valuable insights for potential buyers, sellers, and real estate professionals in Ames, Iowa.

Data Preprocessing for Ames Housing Data Analysis

The preprocessing of data for the Ames Housing dataset is crucial for accurate analysis. This involves cleaning the data and addressing missing values to ensure the integrity of predictive modeling outcomes. Anomalies in the data could significantly impact the results, making thorough preprocessing essential for reliable analysis.

Loading Dataset and Performing Intial Exploration

The datasets “train.csv” and “test.csv” are imported into R using the read.csv function, ensuring all necessary data is successfully loaded. The dataset “train.csv” is specifically loaded into the variable named “ameshous_train_data” for further exploration and processing.

ameshous_train_data <- read.csv("datasets/train.csv")
ameshous_test_data <- read.csv("datasets/test.csv")
summary(ameshous_train_data)
##        Id           MSSubClass      MSZoning          LotFrontage    
##  Min.   :   1.0   Min.   : 20.0   Length:1460        Min.   : 21.00  
##  1st Qu.: 365.8   1st Qu.: 20.0   Class :character   1st Qu.: 59.00  
##  Median : 730.5   Median : 50.0   Mode  :character   Median : 69.00  
##  Mean   : 730.5   Mean   : 56.9                      Mean   : 70.05  
##  3rd Qu.:1095.2   3rd Qu.: 70.0                      3rd Qu.: 80.00  
##  Max.   :1460.0   Max.   :190.0                      Max.   :313.00  
##                                                      NA's   :259     
##     LotArea          Street             Alley             LotShape        
##  Min.   :  1300   Length:1460        Length:1460        Length:1460       
##  1st Qu.:  7554   Class :character   Class :character   Class :character  
##  Median :  9478   Mode  :character   Mode  :character   Mode  :character  
##  Mean   : 10517                                                           
##  3rd Qu.: 11602                                                           
##  Max.   :215245                                                           
##                                                                           
##  LandContour         Utilities          LotConfig          LandSlope        
##  Length:1460        Length:1460        Length:1460        Length:1460       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  Neighborhood        Condition1         Condition2          BldgType        
##  Length:1460        Length:1460        Length:1460        Length:1460       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##   HouseStyle         OverallQual      OverallCond      YearBuilt   
##  Length:1460        Min.   : 1.000   Min.   :1.000   Min.   :1872  
##  Class :character   1st Qu.: 5.000   1st Qu.:5.000   1st Qu.:1954  
##  Mode  :character   Median : 6.000   Median :5.000   Median :1973  
##                     Mean   : 6.099   Mean   :5.575   Mean   :1971  
##                     3rd Qu.: 7.000   3rd Qu.:6.000   3rd Qu.:2000  
##                     Max.   :10.000   Max.   :9.000   Max.   :2010  
##                                                                    
##   YearRemodAdd   RoofStyle           RoofMatl         Exterior1st       
##  Min.   :1950   Length:1460        Length:1460        Length:1460       
##  1st Qu.:1967   Class :character   Class :character   Class :character  
##  Median :1994   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :1985                                                           
##  3rd Qu.:2004                                                           
##  Max.   :2010                                                           
##                                                                         
##  Exterior2nd         MasVnrType          MasVnrArea      ExterQual        
##  Length:1460        Length:1460        Min.   :   0.0   Length:1460       
##  Class :character   Class :character   1st Qu.:   0.0   Class :character  
##  Mode  :character   Mode  :character   Median :   0.0   Mode  :character  
##                                        Mean   : 103.7                     
##                                        3rd Qu.: 166.0                     
##                                        Max.   :1600.0                     
##                                        NA's   :8                          
##   ExterCond          Foundation          BsmtQual           BsmtCond        
##  Length:1460        Length:1460        Length:1460        Length:1460       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  BsmtExposure       BsmtFinType1         BsmtFinSF1     BsmtFinType2      
##  Length:1460        Length:1460        Min.   :   0.0   Length:1460       
##  Class :character   Class :character   1st Qu.:   0.0   Class :character  
##  Mode  :character   Mode  :character   Median : 383.5   Mode  :character  
##                                        Mean   : 443.6                     
##                                        3rd Qu.: 712.2                     
##                                        Max.   :5644.0                     
##                                                                           
##    BsmtFinSF2        BsmtUnfSF       TotalBsmtSF       Heating         
##  Min.   :   0.00   Min.   :   0.0   Min.   :   0.0   Length:1460       
##  1st Qu.:   0.00   1st Qu.: 223.0   1st Qu.: 795.8   Class :character  
##  Median :   0.00   Median : 477.5   Median : 991.5   Mode  :character  
##  Mean   :  46.55   Mean   : 567.2   Mean   :1057.4                     
##  3rd Qu.:   0.00   3rd Qu.: 808.0   3rd Qu.:1298.2                     
##  Max.   :1474.00   Max.   :2336.0   Max.   :6110.0                     
##                                                                        
##   HeatingQC          CentralAir         Electrical          X1stFlrSF   
##  Length:1460        Length:1460        Length:1460        Min.   : 334  
##  Class :character   Class :character   Class :character   1st Qu.: 882  
##  Mode  :character   Mode  :character   Mode  :character   Median :1087  
##                                                           Mean   :1163  
##                                                           3rd Qu.:1391  
##                                                           Max.   :4692  
##                                                                         
##    X2ndFlrSF     LowQualFinSF       GrLivArea     BsmtFullBath   
##  Min.   :   0   Min.   :  0.000   Min.   : 334   Min.   :0.0000  
##  1st Qu.:   0   1st Qu.:  0.000   1st Qu.:1130   1st Qu.:0.0000  
##  Median :   0   Median :  0.000   Median :1464   Median :0.0000  
##  Mean   : 347   Mean   :  5.845   Mean   :1515   Mean   :0.4253  
##  3rd Qu.: 728   3rd Qu.:  0.000   3rd Qu.:1777   3rd Qu.:1.0000  
##  Max.   :2065   Max.   :572.000   Max.   :5642   Max.   :3.0000  
##                                                                  
##   BsmtHalfBath        FullBath        HalfBath       BedroomAbvGr  
##  Min.   :0.00000   Min.   :0.000   Min.   :0.0000   Min.   :0.000  
##  1st Qu.:0.00000   1st Qu.:1.000   1st Qu.:0.0000   1st Qu.:2.000  
##  Median :0.00000   Median :2.000   Median :0.0000   Median :3.000  
##  Mean   :0.05753   Mean   :1.565   Mean   :0.3829   Mean   :2.866  
##  3rd Qu.:0.00000   3rd Qu.:2.000   3rd Qu.:1.0000   3rd Qu.:3.000  
##  Max.   :2.00000   Max.   :3.000   Max.   :2.0000   Max.   :8.000  
##                                                                    
##   KitchenAbvGr   KitchenQual         TotRmsAbvGrd     Functional       
##  Min.   :0.000   Length:1460        Min.   : 2.000   Length:1460       
##  1st Qu.:1.000   Class :character   1st Qu.: 5.000   Class :character  
##  Median :1.000   Mode  :character   Median : 6.000   Mode  :character  
##  Mean   :1.047                      Mean   : 6.518                     
##  3rd Qu.:1.000                      3rd Qu.: 7.000                     
##  Max.   :3.000                      Max.   :14.000                     
##                                                                        
##    Fireplaces    FireplaceQu         GarageType         GarageYrBlt  
##  Min.   :0.000   Length:1460        Length:1460        Min.   :1900  
##  1st Qu.:0.000   Class :character   Class :character   1st Qu.:1961  
##  Median :1.000   Mode  :character   Mode  :character   Median :1980  
##  Mean   :0.613                                         Mean   :1979  
##  3rd Qu.:1.000                                         3rd Qu.:2002  
##  Max.   :3.000                                         Max.   :2010  
##                                                        NA's   :81    
##  GarageFinish         GarageCars      GarageArea      GarageQual       
##  Length:1460        Min.   :0.000   Min.   :   0.0   Length:1460       
##  Class :character   1st Qu.:1.000   1st Qu.: 334.5   Class :character  
##  Mode  :character   Median :2.000   Median : 480.0   Mode  :character  
##                     Mean   :1.767   Mean   : 473.0                     
##                     3rd Qu.:2.000   3rd Qu.: 576.0                     
##                     Max.   :4.000   Max.   :1418.0                     
##                                                                        
##   GarageCond         PavedDrive          WoodDeckSF      OpenPorchSF    
##  Length:1460        Length:1460        Min.   :  0.00   Min.   :  0.00  
##  Class :character   Class :character   1st Qu.:  0.00   1st Qu.:  0.00  
##  Mode  :character   Mode  :character   Median :  0.00   Median : 25.00  
##                                        Mean   : 94.24   Mean   : 46.66  
##                                        3rd Qu.:168.00   3rd Qu.: 68.00  
##                                        Max.   :857.00   Max.   :547.00  
##                                                                         
##  EnclosedPorch      X3SsnPorch      ScreenPorch        PoolArea      
##  Min.   :  0.00   Min.   :  0.00   Min.   :  0.00   Min.   :  0.000  
##  1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.:  0.000  
##  Median :  0.00   Median :  0.00   Median :  0.00   Median :  0.000  
##  Mean   : 21.95   Mean   :  3.41   Mean   : 15.06   Mean   :  2.759  
##  3rd Qu.:  0.00   3rd Qu.:  0.00   3rd Qu.:  0.00   3rd Qu.:  0.000  
##  Max.   :552.00   Max.   :508.00   Max.   :480.00   Max.   :738.000  
##                                                                      
##     PoolQC             Fence           MiscFeature           MiscVal        
##  Length:1460        Length:1460        Length:1460        Min.   :    0.00  
##  Class :character   Class :character   Class :character   1st Qu.:    0.00  
##  Mode  :character   Mode  :character   Mode  :character   Median :    0.00  
##                                                           Mean   :   43.49  
##                                                           3rd Qu.:    0.00  
##                                                           Max.   :15500.00  
##                                                                             
##      MoSold           YrSold       SaleType         SaleCondition     
##  Min.   : 1.000   Min.   :2006   Length:1460        Length:1460       
##  1st Qu.: 5.000   1st Qu.:2007   Class :character   Class :character  
##  Median : 6.000   Median :2008   Mode  :character   Mode  :character  
##  Mean   : 6.322   Mean   :2008                                        
##  3rd Qu.: 8.000   3rd Qu.:2009                                        
##  Max.   :12.000   Max.   :2010                                        
##                                                                       
##    SalePrice     
##  Min.   : 34900  
##  1st Qu.:129975  
##  Median :163000  
##  Mean   :180921  
##  3rd Qu.:214000  
##  Max.   :755000  
## 
str(ameshous_train_data)
## 'data.frame':    1460 obs. of  81 variables:
##  $ Id           : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ MSSubClass   : int  60 20 60 70 60 50 20 60 50 190 ...
##  $ MSZoning     : chr  "RL" "RL" "RL" "RL" ...
##  $ LotFrontage  : int  65 80 68 60 84 85 75 NA 51 50 ...
##  $ LotArea      : int  8450 9600 11250 9550 14260 14115 10084 10382 6120 7420 ...
##  $ Street       : chr  "Pave" "Pave" "Pave" "Pave" ...
##  $ Alley        : chr  NA NA NA NA ...
##  $ LotShape     : chr  "Reg" "Reg" "IR1" "IR1" ...
##  $ LandContour  : chr  "Lvl" "Lvl" "Lvl" "Lvl" ...
##  $ Utilities    : chr  "AllPub" "AllPub" "AllPub" "AllPub" ...
##  $ LotConfig    : chr  "Inside" "FR2" "Inside" "Corner" ...
##  $ LandSlope    : chr  "Gtl" "Gtl" "Gtl" "Gtl" ...
##  $ Neighborhood : chr  "CollgCr" "Veenker" "CollgCr" "Crawfor" ...
##  $ Condition1   : chr  "Norm" "Feedr" "Norm" "Norm" ...
##  $ Condition2   : chr  "Norm" "Norm" "Norm" "Norm" ...
##  $ BldgType     : chr  "1Fam" "1Fam" "1Fam" "1Fam" ...
##  $ HouseStyle   : chr  "2Story" "1Story" "2Story" "2Story" ...
##  $ OverallQual  : int  7 6 7 7 8 5 8 7 7 5 ...
##  $ OverallCond  : int  5 8 5 5 5 5 5 6 5 6 ...
##  $ YearBuilt    : int  2003 1976 2001 1915 2000 1993 2004 1973 1931 1939 ...
##  $ YearRemodAdd : int  2003 1976 2002 1970 2000 1995 2005 1973 1950 1950 ...
##  $ RoofStyle    : chr  "Gable" "Gable" "Gable" "Gable" ...
##  $ RoofMatl     : chr  "CompShg" "CompShg" "CompShg" "CompShg" ...
##  $ Exterior1st  : chr  "VinylSd" "MetalSd" "VinylSd" "Wd Sdng" ...
##  $ Exterior2nd  : chr  "VinylSd" "MetalSd" "VinylSd" "Wd Shng" ...
##  $ MasVnrType   : chr  "BrkFace" "None" "BrkFace" "None" ...
##  $ MasVnrArea   : int  196 0 162 0 350 0 186 240 0 0 ...
##  $ ExterQual    : chr  "Gd" "TA" "Gd" "TA" ...
##  $ ExterCond    : chr  "TA" "TA" "TA" "TA" ...
##  $ Foundation   : chr  "PConc" "CBlock" "PConc" "BrkTil" ...
##  $ BsmtQual     : chr  "Gd" "Gd" "Gd" "TA" ...
##  $ BsmtCond     : chr  "TA" "TA" "TA" "Gd" ...
##  $ BsmtExposure : chr  "No" "Gd" "Mn" "No" ...
##  $ BsmtFinType1 : chr  "GLQ" "ALQ" "GLQ" "ALQ" ...
##  $ BsmtFinSF1   : int  706 978 486 216 655 732 1369 859 0 851 ...
##  $ BsmtFinType2 : chr  "Unf" "Unf" "Unf" "Unf" ...
##  $ BsmtFinSF2   : int  0 0 0 0 0 0 0 32 0 0 ...
##  $ BsmtUnfSF    : int  150 284 434 540 490 64 317 216 952 140 ...
##  $ TotalBsmtSF  : int  856 1262 920 756 1145 796 1686 1107 952 991 ...
##  $ Heating      : chr  "GasA" "GasA" "GasA" "GasA" ...
##  $ HeatingQC    : chr  "Ex" "Ex" "Ex" "Gd" ...
##  $ CentralAir   : chr  "Y" "Y" "Y" "Y" ...
##  $ Electrical   : chr  "SBrkr" "SBrkr" "SBrkr" "SBrkr" ...
##  $ X1stFlrSF    : int  856 1262 920 961 1145 796 1694 1107 1022 1077 ...
##  $ X2ndFlrSF    : int  854 0 866 756 1053 566 0 983 752 0 ...
##  $ LowQualFinSF : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ GrLivArea    : int  1710 1262 1786 1717 2198 1362 1694 2090 1774 1077 ...
##  $ BsmtFullBath : int  1 0 1 1 1 1 1 1 0 1 ...
##  $ BsmtHalfBath : int  0 1 0 0 0 0 0 0 0 0 ...
##  $ FullBath     : int  2 2 2 1 2 1 2 2 2 1 ...
##  $ HalfBath     : int  1 0 1 0 1 1 0 1 0 0 ...
##  $ BedroomAbvGr : int  3 3 3 3 4 1 3 3 2 2 ...
##  $ KitchenAbvGr : int  1 1 1 1 1 1 1 1 2 2 ...
##  $ KitchenQual  : chr  "Gd" "TA" "Gd" "Gd" ...
##  $ TotRmsAbvGrd : int  8 6 6 7 9 5 7 7 8 5 ...
##  $ Functional   : chr  "Typ" "Typ" "Typ" "Typ" ...
##  $ Fireplaces   : int  0 1 1 1 1 0 1 2 2 2 ...
##  $ FireplaceQu  : chr  NA "TA" "TA" "Gd" ...
##  $ GarageType   : chr  "Attchd" "Attchd" "Attchd" "Detchd" ...
##  $ GarageYrBlt  : int  2003 1976 2001 1998 2000 1993 2004 1973 1931 1939 ...
##  $ GarageFinish : chr  "RFn" "RFn" "RFn" "Unf" ...
##  $ GarageCars   : int  2 2 2 3 3 2 2 2 2 1 ...
##  $ GarageArea   : int  548 460 608 642 836 480 636 484 468 205 ...
##  $ GarageQual   : chr  "TA" "TA" "TA" "TA" ...
##  $ GarageCond   : chr  "TA" "TA" "TA" "TA" ...
##  $ PavedDrive   : chr  "Y" "Y" "Y" "Y" ...
##  $ WoodDeckSF   : int  0 298 0 0 192 40 255 235 90 0 ...
##  $ OpenPorchSF  : int  61 0 42 35 84 30 57 204 0 4 ...
##  $ EnclosedPorch: int  0 0 0 272 0 0 0 228 205 0 ...
##  $ X3SsnPorch   : int  0 0 0 0 0 320 0 0 0 0 ...
##  $ ScreenPorch  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ PoolArea     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ PoolQC       : chr  NA NA NA NA ...
##  $ Fence        : chr  NA NA NA NA ...
##  $ MiscFeature  : chr  NA NA NA NA ...
##  $ MiscVal      : int  0 0 0 0 0 700 0 350 0 0 ...
##  $ MoSold       : int  2 5 9 2 12 10 8 11 4 1 ...
##  $ YrSold       : int  2008 2007 2008 2006 2008 2009 2007 2009 2008 2008 ...
##  $ SaleType     : chr  "WD" "WD" "WD" "WD" ...
##  $ SaleCondition: chr  "Normal" "Normal" "Normal" "Abnorml" ...
##  $ SalePrice    : int  208500 181500 223500 140000 250000 143000 307000 200000 129900 118000 ...
head(ameshous_train_data)
##   Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour
## 1  1         60       RL          65    8450   Pave  <NA>      Reg         Lvl
## 2  2         20       RL          80    9600   Pave  <NA>      Reg         Lvl
## 3  3         60       RL          68   11250   Pave  <NA>      IR1         Lvl
## 4  4         70       RL          60    9550   Pave  <NA>      IR1         Lvl
## 5  5         60       RL          84   14260   Pave  <NA>      IR1         Lvl
## 6  6         50       RL          85   14115   Pave  <NA>      IR1         Lvl
##   Utilities LotConfig LandSlope Neighborhood Condition1 Condition2 BldgType
## 1    AllPub    Inside       Gtl      CollgCr       Norm       Norm     1Fam
## 2    AllPub       FR2       Gtl      Veenker      Feedr       Norm     1Fam
## 3    AllPub    Inside       Gtl      CollgCr       Norm       Norm     1Fam
## 4    AllPub    Corner       Gtl      Crawfor       Norm       Norm     1Fam
## 5    AllPub       FR2       Gtl      NoRidge       Norm       Norm     1Fam
## 6    AllPub    Inside       Gtl      Mitchel       Norm       Norm     1Fam
##   HouseStyle OverallQual OverallCond YearBuilt YearRemodAdd RoofStyle RoofMatl
## 1     2Story           7           5      2003         2003     Gable  CompShg
## 2     1Story           6           8      1976         1976     Gable  CompShg
## 3     2Story           7           5      2001         2002     Gable  CompShg
## 4     2Story           7           5      1915         1970     Gable  CompShg
## 5     2Story           8           5      2000         2000     Gable  CompShg
## 6     1.5Fin           5           5      1993         1995     Gable  CompShg
##   Exterior1st Exterior2nd MasVnrType MasVnrArea ExterQual ExterCond Foundation
## 1     VinylSd     VinylSd    BrkFace        196        Gd        TA      PConc
## 2     MetalSd     MetalSd       None          0        TA        TA     CBlock
## 3     VinylSd     VinylSd    BrkFace        162        Gd        TA      PConc
## 4     Wd Sdng     Wd Shng       None          0        TA        TA     BrkTil
## 5     VinylSd     VinylSd    BrkFace        350        Gd        TA      PConc
## 6     VinylSd     VinylSd       None          0        TA        TA       Wood
##   BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2
## 1       Gd       TA           No          GLQ        706          Unf
## 2       Gd       TA           Gd          ALQ        978          Unf
## 3       Gd       TA           Mn          GLQ        486          Unf
## 4       TA       Gd           No          ALQ        216          Unf
## 5       Gd       TA           Av          GLQ        655          Unf
## 6       Gd       TA           No          GLQ        732          Unf
##   BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating HeatingQC CentralAir Electrical
## 1          0       150         856    GasA        Ex          Y      SBrkr
## 2          0       284        1262    GasA        Ex          Y      SBrkr
## 3          0       434         920    GasA        Ex          Y      SBrkr
## 4          0       540         756    GasA        Gd          Y      SBrkr
## 5          0       490        1145    GasA        Ex          Y      SBrkr
## 6          0        64         796    GasA        Ex          Y      SBrkr
##   X1stFlrSF X2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath
## 1       856       854            0      1710            1            0        2
## 2      1262         0            0      1262            0            1        2
## 3       920       866            0      1786            1            0        2
## 4       961       756            0      1717            1            0        1
## 5      1145      1053            0      2198            1            0        2
## 6       796       566            0      1362            1            0        1
##   HalfBath BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd Functional
## 1        1            3            1          Gd            8        Typ
## 2        0            3            1          TA            6        Typ
## 3        1            3            1          Gd            6        Typ
## 4        0            3            1          Gd            7        Typ
## 5        1            4            1          Gd            9        Typ
## 6        1            1            1          TA            5        Typ
##   Fireplaces FireplaceQu GarageType GarageYrBlt GarageFinish GarageCars
## 1          0        <NA>     Attchd        2003          RFn          2
## 2          1          TA     Attchd        1976          RFn          2
## 3          1          TA     Attchd        2001          RFn          2
## 4          1          Gd     Detchd        1998          Unf          3
## 5          1          TA     Attchd        2000          RFn          3
## 6          0        <NA>     Attchd        1993          Unf          2
##   GarageArea GarageQual GarageCond PavedDrive WoodDeckSF OpenPorchSF
## 1        548         TA         TA          Y          0          61
## 2        460         TA         TA          Y        298           0
## 3        608         TA         TA          Y          0          42
## 4        642         TA         TA          Y          0          35
## 5        836         TA         TA          Y        192          84
## 6        480         TA         TA          Y         40          30
##   EnclosedPorch X3SsnPorch ScreenPorch PoolArea PoolQC Fence MiscFeature
## 1             0          0           0        0   <NA>  <NA>        <NA>
## 2             0          0           0        0   <NA>  <NA>        <NA>
## 3             0          0           0        0   <NA>  <NA>        <NA>
## 4           272          0           0        0   <NA>  <NA>        <NA>
## 5             0          0           0        0   <NA>  <NA>        <NA>
## 6             0        320           0        0   <NA> MnPrv        Shed
##   MiscVal MoSold YrSold SaleType SaleCondition SalePrice
## 1       0      2   2008       WD        Normal    208500
## 2       0      5   2007       WD        Normal    181500
## 3       0      9   2008       WD        Normal    223500
## 4       0      2   2006       WD       Abnorml    140000
## 5       0     12   2008       WD        Normal    250000
## 6     700     10   2009       WD        Normal    143000
dim(ameshous_train_data)
## [1] 1460   81
colnames(ameshous_train_data)
##  [1] "Id"            "MSSubClass"    "MSZoning"      "LotFrontage"  
##  [5] "LotArea"       "Street"        "Alley"         "LotShape"     
##  [9] "LandContour"   "Utilities"     "LotConfig"     "LandSlope"    
## [13] "Neighborhood"  "Condition1"    "Condition2"    "BldgType"     
## [17] "HouseStyle"    "OverallQual"   "OverallCond"   "YearBuilt"    
## [21] "YearRemodAdd"  "RoofStyle"     "RoofMatl"      "Exterior1st"  
## [25] "Exterior2nd"   "MasVnrType"    "MasVnrArea"    "ExterQual"    
## [29] "ExterCond"     "Foundation"    "BsmtQual"      "BsmtCond"     
## [33] "BsmtExposure"  "BsmtFinType1"  "BsmtFinSF1"    "BsmtFinType2" 
## [37] "BsmtFinSF2"    "BsmtUnfSF"     "TotalBsmtSF"   "Heating"      
## [41] "HeatingQC"     "CentralAir"    "Electrical"    "X1stFlrSF"    
## [45] "X2ndFlrSF"     "LowQualFinSF"  "GrLivArea"     "BsmtFullBath" 
## [49] "BsmtHalfBath"  "FullBath"      "HalfBath"      "BedroomAbvGr" 
## [53] "KitchenAbvGr"  "KitchenQual"   "TotRmsAbvGrd"  "Functional"   
## [57] "Fireplaces"    "FireplaceQu"   "GarageType"    "GarageYrBlt"  
## [61] "GarageFinish"  "GarageCars"    "GarageArea"    "GarageQual"   
## [65] "GarageCond"    "PavedDrive"    "WoodDeckSF"    "OpenPorchSF"  
## [69] "EnclosedPorch" "X3SsnPorch"    "ScreenPorch"   "PoolArea"     
## [73] "PoolQC"        "Fence"         "MiscFeature"   "MiscVal"      
## [77] "MoSold"        "YrSold"        "SaleType"      "SaleCondition"
## [81] "SalePrice"
view(ameshous_train_data)

Handling and looking for Missing Values in the loaded dataset

The code utilizes dplyr and tidyr to summarize missing values in the “ameshous_train_data” dataset. It calculates NA counts for each column, then transforms the data into a long format with variables and their corresponding missing value counts. This process streamlines analysis, highlighting variables like PoolQC, MiscFeature, Alley, and Fence with the most missing values (all labeled as “NAs”).

missing_values <- ameshous_train_data %>%
  summarise(across(everything(), ~sum(is.na(.)))) %>%
  pivot_longer(cols = everything(), names_to = "Variable", values_to = "MissingCount") %>%
  filter(MissingCount > 0) %>%
  arrange(desc(MissingCount)) 

missing_values <- as.data.frame(missing_values)
missing_values <- missing_values[missing_values$MissingCount > 0, ]
missing_data_df <- data.frame(Variable = missing_values$Variable, MissingCount = missing_values$MissingCount)

print(missing_values)
##        Variable MissingCount
## 1        PoolQC         1453
## 2   MiscFeature         1406
## 3         Alley         1369
## 4         Fence         1179
## 5   FireplaceQu          690
## 6   LotFrontage          259
## 7    GarageType           81
## 8   GarageYrBlt           81
## 9  GarageFinish           81
## 10   GarageQual           81
## 11   GarageCond           81
## 12 BsmtExposure           38
## 13 BsmtFinType2           38
## 14     BsmtQual           37
## 15     BsmtCond           37
## 16 BsmtFinType1           37
## 17   MasVnrType            8
## 18   MasVnrArea            8
## 19   Electrical            1

Visualisation of Missing Values in the Dataset

This code snippet in R uses ggplot2 to create a bar chart displaying missing data counts for each variable in the “missing_data_df” dataframe. The x-axis is ordered by missing counts, bars are blue with a white border, and labeled with counts using geom_text. A gradient fill effect highlights variables with higher missing counts.

ggplot(missing_data_df, aes(x = reorder(Variable, -MissingCount), y = MissingCount)) +
  geom_bar(stat = "identity", fill = "blue", color = "white") +
  geom_text(aes(label = MissingCount), vjust = -0.3, color = "black", size = 3.5) +
  scale_fill_gradient(low = "lightblue", high = "blue", name = "Missing Count") +
  labs(
    title = "Missing Data Counts by Variable",
    subtitle = "Total counts of missing entries for each variable in the dataset",
    x = "Variable",
    y = "Number of Missing Values"
  ) +
  theme_minimal(base_size = 14) +
  theme(
    plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
    plot.subtitle = element_text(size = 12),
    axis.title = element_text(size = 14),
    axis.text.x = element_text(angle = 45, hjust = 1, size = 12, color = "gray50"),
    axis.text.y = element_text(size = 12, color = "gray50"),
    legend.position = "none",
    plot.margin = unit(c(10, 10, 10, 10), "pt")
  )

Treating the Missing Values

The R code below cleans the missing values from the dataset and saves the cleaned data as “amesclean_train_data.csv”.

features_none = c("Alley", "MasVnrType", "BsmtQual", "BsmtCond", "BsmtExposure",
                  "BsmtFinType1", "BsmtFinType2", "FireplaceQu", "GarageType", 
                  "GarageFinish", "GarageQual", "GarageCond", "PoolQC", "Fence", "MiscFeature","LotFrontage")

for (feature in features_none) {
  ameshous_train_data[[feature]][is.na(ameshous_train_data[[feature]])] <- "None"
}

ameshous_train_data$MasVnrArea[is.na(ameshous_train_data$MasVnrArea)] <- 0

ameshous_train_data$GarageYrBlt[is.na(ameshous_train_data$GarageYrBlt)] <- ameshous_train_data$YearBuilt[is.na(ameshous_train_data$GarageYrBlt)]

mode_electrical <- names(which.max(table(ameshous_train_data$Electrical)))
ameshous_train_data$Electrical[is.na(ameshous_train_data$Electrical)] <- mode_electrical

missing_values_summary <- sapply(ameshous_train_data, function(x) sum(is.na(x)))
missing_columns <- names(missing_values_summary)[missing_values_summary > 0]
missing_values_df <- ameshous_train_data[, missing_columns]
print(missing_values_summary)
##            Id    MSSubClass      MSZoning   LotFrontage       LotArea 
##             0             0             0             0             0 
##        Street         Alley      LotShape   LandContour     Utilities 
##             0             0             0             0             0 
##     LotConfig     LandSlope  Neighborhood    Condition1    Condition2 
##             0             0             0             0             0 
##      BldgType    HouseStyle   OverallQual   OverallCond     YearBuilt 
##             0             0             0             0             0 
##  YearRemodAdd     RoofStyle      RoofMatl   Exterior1st   Exterior2nd 
##             0             0             0             0             0 
##    MasVnrType    MasVnrArea     ExterQual     ExterCond    Foundation 
##             0             0             0             0             0 
##      BsmtQual      BsmtCond  BsmtExposure  BsmtFinType1    BsmtFinSF1 
##             0             0             0             0             0 
##  BsmtFinType2    BsmtFinSF2     BsmtUnfSF   TotalBsmtSF       Heating 
##             0             0             0             0             0 
##     HeatingQC    CentralAir    Electrical     X1stFlrSF     X2ndFlrSF 
##             0             0             0             0             0 
##  LowQualFinSF     GrLivArea  BsmtFullBath  BsmtHalfBath      FullBath 
##             0             0             0             0             0 
##      HalfBath  BedroomAbvGr  KitchenAbvGr   KitchenQual  TotRmsAbvGrd 
##             0             0             0             0             0 
##    Functional    Fireplaces   FireplaceQu    GarageType   GarageYrBlt 
##             0             0             0             0             0 
##  GarageFinish    GarageCars    GarageArea    GarageQual    GarageCond 
##             0             0             0             0             0 
##    PavedDrive    WoodDeckSF   OpenPorchSF EnclosedPorch    X3SsnPorch 
##             0             0             0             0             0 
##   ScreenPorch      PoolArea        PoolQC         Fence   MiscFeature 
##             0             0             0             0             0 
##       MiscVal        MoSold        YrSold      SaleType SaleCondition 
##             0             0             0             0             0 
##     SalePrice 
##             0
write.csv(ameshous_train_data, "datasets/amesclean_train_data.csv", row.names = FALSE)

Understanding Dataset

This R script visualizes the correlation matrix of numerical features from the “ameshous_train_data” dataset using ggplot2, corrplot, and reshape2. It extracts numeric columns, calculates the correlation matrix, reshapes it, and plots it as a tile plot with color indicating strength and direction of correlation. Text labels show numeric correlation values, and minimal styling enhances readability.

ames_housing <- read.csv("datasets/amesclean_train_data.csv")
ames_numeric <- ames_housing[sapply(ames_housing, is.numeric)]

cor_matrix <- cor(ames_numeric, use = "pairwise.complete.obs")

cor_melted <- melt(cor_matrix)

ggplot(cor_melted, aes(x = Var1, y = Var2, fill = value)) +
  geom_tile(color = "white", linewidth = 0.2) +
  geom_text(aes(label = sprintf("%.2f", value)), color = "black", size = 3, vjust = 1) +
  scale_fill_gradient2(low = "blue", mid = "white", high = "red", midpoint = 0, limit = c(-1, 1), name="Correlation") +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1, size = 8),
    axis.text.y = element_text(size = 8),
    axis.title = element_blank(),
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(),
    panel.background = element_rect(fill = "gray95"),
    plot.background = element_rect(color = "gray95", fill = "gray95")
  ) +
  labs(title = "Correlation Matrix of Housing Features", subtitle = "Numeric features of the Ames Housing dataset")

In further analysis, the goal is to determine the top 10 highly correlated variables with the target variable “Sale Price.” It’s evident that “OverallCond” is among the highly correlated variables with “Sale Price.”

names(ameshous_train_data)
##  [1] "Id"            "MSSubClass"    "MSZoning"      "LotFrontage"  
##  [5] "LotArea"       "Street"        "Alley"         "LotShape"     
##  [9] "LandContour"   "Utilities"     "LotConfig"     "LandSlope"    
## [13] "Neighborhood"  "Condition1"    "Condition2"    "BldgType"     
## [17] "HouseStyle"    "OverallQual"   "OverallCond"   "YearBuilt"    
## [21] "YearRemodAdd"  "RoofStyle"     "RoofMatl"      "Exterior1st"  
## [25] "Exterior2nd"   "MasVnrType"    "MasVnrArea"    "ExterQual"    
## [29] "ExterCond"     "Foundation"    "BsmtQual"      "BsmtCond"     
## [33] "BsmtExposure"  "BsmtFinType1"  "BsmtFinSF1"    "BsmtFinType2" 
## [37] "BsmtFinSF2"    "BsmtUnfSF"     "TotalBsmtSF"   "Heating"      
## [41] "HeatingQC"     "CentralAir"    "Electrical"    "X1stFlrSF"    
## [45] "X2ndFlrSF"     "LowQualFinSF"  "GrLivArea"     "BsmtFullBath" 
## [49] "BsmtHalfBath"  "FullBath"      "HalfBath"      "BedroomAbvGr" 
## [53] "KitchenAbvGr"  "KitchenQual"   "TotRmsAbvGrd"  "Functional"   
## [57] "Fireplaces"    "FireplaceQu"   "GarageType"    "GarageYrBlt"  
## [61] "GarageFinish"  "GarageCars"    "GarageArea"    "GarageQual"   
## [65] "GarageCond"    "PavedDrive"    "WoodDeckSF"    "OpenPorchSF"  
## [69] "EnclosedPorch" "X3SsnPorch"    "ScreenPorch"   "PoolArea"     
## [73] "PoolQC"        "Fence"         "MiscFeature"   "MiscVal"      
## [77] "MoSold"        "YrSold"        "SaleType"      "SaleCondition"
## [81] "SalePrice"
sale_price_correlations <- cor_matrix[,"SalePrice", drop = FALSE]
sorted_correlations <- sort(sale_price_correlations[,1], decreasing = TRUE)

top_correlations <- head(sorted_correlations[-1], 10)

cor_data <- data.frame(
  Variable = names(top_correlations),
  Correlation = top_correlations
)

cor_melted <- melt(cor_data, id.vars = "Variable")

ggplot(cor_data, aes(x = Variable, y = factor(1, levels = "SalePrice"), fill = Correlation)) +
  geom_tile(color = "white", size = 0.5) +
  geom_text(aes(label = sprintf("%.2f", Correlation)), color = "black", size = 5, vjust = 0.5) +
  scale_fill_gradient2(low = "blue", high = "red", mid = "white", midpoint = 0, limit = c(-1, 1), name="Correlation") +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1),
    axis.title.x = element_blank(),
    axis.text.y = element_blank(),
    axis.ticks.y = element_blank(),
    axis.title.y = element_blank(),
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(),
    panel.background = element_rect(fill = "white"),
    plot.background = element_rect(fill = "white")
  ) +
  labs(title = "Top 10 Variables Correlated with SalePrice", x = "Variables", y = "")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Distribution of Sale Price by Frequency (No. of Houses)

The histogram with a density plot overlay illustrates the distribution of sale prices in the Ames housing dataset. The plot reveals a concentration of values in the lower price range with a tail extending towards higher values, suggesting fewer houses at higher sale prices. The mean sale price is approximately $180,921.20, serving as the central point of the distribution. The right-skewed shape of the distribution indicates some houses with significantly higher prices, affecting the mean more than the median or mode.

ggplot(ames_housing, aes(x = SalePrice)) +
  geom_histogram(bins = 30, fill = "blue", color = "black", alpha = 0.7) +
  geom_density(aes(y = after_stat(count * 30), color = "Density"), fill = "lightblue", alpha = 0.3) +
  labs(title = "Distribution of Sale Prices", x = "Sale Price", y = "Frequency") +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 16, face = "bold"),
    axis.title = element_text(size = 14),
    axis.text = element_text(size = 12),
    legend.position = "bottom"
  ) +
  scale_color_manual(values = c("Density" = "red")) +
  guides(color = guide_legend(title = "Overlay"))

mean_sale_price <- mean(ames_housing$SalePrice)
print(mean_sale_price)
## [1] 180921.2

Distribution of Sale Price, showing outliers by using Boxplot

The boxplot with jittered data points displays the distribution of sale prices in the Ames housing dataset. It showcases the interquartile range, median line, and outliers, highlighting homes with significantly higher prices. Jittered points visually represent data density across price segments, discerning variability in housing prices and identifying outliers potentially due to unique features or desirable locations.

ggplot(ames_housing, aes(y = SalePrice)) +
  geom_boxplot(fill = "lightblue", color = "black", alpha = 0.7, outlier.shape = 16, outlier.size = 2) +
  geom_jitter(aes(x = 1), color = "blue", alpha = 0.3, width = 0.1) +
  labs(y = "Sale Price") +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
    axis.title.y = element_text(size = 14)
  ) +
  ggtitle("Distribution of Sale Price")

Correlation Matrix for Physical Features of House by Overall Quality, Overall Cond, and Year Built

ames_housing %>%
  select(OverallQual, OverallCond, YearBuilt, RoofStyle, Exterior1st, Exterior2nd) %>%
  summary()
##   OverallQual      OverallCond      YearBuilt     RoofStyle        
##  Min.   : 1.000   Min.   :1.000   Min.   :1872   Length:1460       
##  1st Qu.: 5.000   1st Qu.:5.000   1st Qu.:1954   Class :character  
##  Median : 6.000   Median :5.000   Median :1973   Mode  :character  
##  Mean   : 6.099   Mean   :5.575   Mean   :1971                     
##  3rd Qu.: 7.000   3rd Qu.:6.000   3rd Qu.:2000                     
##  Max.   :10.000   Max.   :9.000   Max.   :2010                     
##  Exterior1st        Exterior2nd       
##  Length:1460        Length:1460       
##  Class :character   Class :character  
##  Mode  :character   Mode  :character  
##                                       
##                                       
## 
ggplot(ames_housing, aes(x = OverallQual)) +
  geom_histogram(binwidth = 1, fill = "blue") +
  labs(title = "Distribution of Overall Quality Ratings")

physical_features <- ames_housing %>% 
  select(OverallQual, OverallCond, YearBuilt)
cor_physical <- cor(physical_features, use = "complete.obs")
ggplot(melt(cor_physical), aes(Var1, Var2, fill = value)) +
  geom_tile() +
  scale_fill_gradient2(midpoint = 0, low = "blue", high = "red", mid = "white") +
  theme_minimal() +
  labs(title = "Correlation Matrix for Physical House Features")

Exploratory Data Analysis (EDA)

  1. How do external features such as proximity and lot area influence the sale price in Ames IOWA?

    In Ames, Iowa’s housing market, external factors like lot area and proximity to various conditions significantly influence sale prices. Larger lots generally command higher prices, though other factors play crucial roles. Properties near negative externalities exhibit lower prices and greater variability, while those in desirable locales command higher prices. The interaction between house style and proximity further underscores these dynamics, revealing the nuanced impact of external features on real estate values.

  • Distribution of Sales Prices
ggplot(ames_housing, aes(x = SalePrice)) +
  geom_histogram(bins = 50, fill = "cornflowerblue", color = "black") +
  labs(title = "Distribution of Sale Prices",
       x = "Sale Price",
       y = "Frequency") +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
    axis.title = element_text(size = 14),
    axis.text = element_text(size = 12)
  )

  • Comparing Lot Area with Sale Price

    The scatter plot shows lot area against sale price in the Ames dataset. Blue points represent properties, with transparency indicating data concentration. A red regression line suggests a positive trend, but wide point spread implies weak correlation. Light blue shading around the line represents the 95% confidence interval, highlighting considerable variability beyond lot area’s influence.

ggplot(ames_housing, aes(x = LotArea, y = SalePrice)) +
  geom_point(alpha = 0.6, color = "blue") +
  geom_smooth(method = "lm", color = "red", se = TRUE, fill = "lightblue", alpha = 0.2) +
  labs(title = "Sale Price vs. Lot Area",
       x = "Lot Area (sq feet)",
       y = "Sale Price") +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
    axis.title = element_text(size = 14),
    axis.text = element_text(size = 12)
  )
## `geom_smooth()` using formula = 'y ~ x'

  • Density of Sale Prices by Major Roadway Proximity (Condition1)

    The plot illustrates sale price density in the Ames dataset, segmented by proximity to major roadways (Condition1). Each color represents a different condition, showcasing sale price distributions. Properties near undesirable features like major roads or railroads exhibit varied pricing, possibly indicating lower values due to negative factors. Conversely, normal or positively noted areas show narrower and higher-peaked distributions, reflecting higher median prices and less variance. This visual insight elucidates the impact of roadway proximity on property values.

ggplot(ames_housing, aes(x = SalePrice, fill = Condition1)) +
  geom_density(alpha = 0.7) +
  labs(title = "Density of Sale Prices by Major Roadway Proximity (Condition1)",
       x = "Sale Price",
       y = "Density") +
  scale_fill_manual(values = c("#1f77b4", "#ff7f0e", "#2ca02c", "#d62728", "#9467bd", "#8c564b", "#e377c2", "#7f7f7f", "#bcbd22", "#17becf")) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
    legend.position = "bottom",
    legend.title = element_blank(),
    axis.title = element_text(size = 14),
    axis.text = element_text(size = 12)
  )

  • Sale Prices by Condition1 and Condition2

    The plot illustrates sale prices in the Ames dataset, categorized by Condition1 and Condition2 (proximity to features like major roadways). Subplots for each Condition1 category (‘Artery’, ‘Feeder’, ‘Norm’) contain boxplots for different Condition2 categories, showing median, quartiles, and range. ‘PosN’ category under Condition1 exhibits higher median prices and less variability compared to ‘Artery’ or ‘Feeder’, indicating proximity to major roadways lowers prices. This visualization elucidates how environmental conditions interact, influencing sale prices.

colors <- brewer.pal(9, "Set1")

ggplot(ames_housing, aes(x = Condition1, y = SalePrice, fill = Condition1)) +
  geom_boxplot(alpha = 0.7, outlier.shape = NA) +
  facet_wrap(~ Condition2, scales = "free_x", nrow = 2, labeller = label_both) +
  labs(title = "Sale Prices by Condition1 and Condition2",
       x = "Condition1",
       y = "Sale Price") +
  scale_fill_manual(values = colors) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 18, face = "bold", hjust = 0.5),
    axis.title = element_text(size = 14),
    axis.text = element_text(size = 12),
    strip.text = element_text(size = 12, face = "bold")
  )

  • Boxplot Sale Prices by Proximity to Major Roadwasy (Condition1)

    The boxplot illustrates sale price distribution relative to proximity to major roadways and railroads (Condition1) in the Ames housing dataset. Categories like ‘Artery’ and ‘Feeder’ (close to major and minor roads) tend to have lower median sale prices and broader ranges, indicating variability in buyer valuation. Conversely, ‘PosN’ (positive near) and ‘RRNn’ (near north railroad) exhibit higher median prices and narrower interquartile ranges, suggesting higher buyer valuation. Outliers, especially in ‘Norm’ and ‘PosN’, may indicate unique features or better conditions. This visualization highlights how environmental factors impact real estate values, with less trafficked areas generally commanding higher prices.

ggplot(ames_housing, aes(x = Condition1, y = SalePrice)) +
  geom_boxplot(outlier.colour = "red", outlier.shape = 16, outlier.size = 3, fill = "lightblue", color = "black", alpha = 0.7) +
  labs(title = "Sale Prices by Proximity to Major Roadways (Condition1)",
       x = "Condition",
       y = "Sale Price") +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
    axis.title = element_text(size = 14),
    axis.text = element_text(size = 12)
  )

  • Interaction of House Style, Condition 1 and Sale Price of House

    The plot demonstrates the interaction between house style and proximity to conditions (Condition1) on sale prices in the Ames dataset. Trend lines for different styles show varying price changes across conditions. For example, ‘2.5 Finished’ homes generally decrease in price near major arterials but hold higher values in favorable conditions like ‘PosN’, indicating the significant impact of location on real estate valuation.

ggplot(ames_housing, aes(x = Condition1, y = SalePrice, color = HouseStyle)) +
  geom_point(alpha = 0.6) +
  geom_smooth(aes(group = HouseStyle), method = "lm", se = FALSE) +
  labs(title = "Interaction of House Style and Condition1 on Sale Prices",
       x = "Condition1",
       y = "Sale Price") +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
    axis.title = element_text(size = 14),
    axis.text = element_text(size = 12)
  )
## `geom_smooth()` using formula = 'y ~ x'

  1. What effects do renovations have on the Sale Price of a house in Ames IOWA?

    Visualizations demonstrate that renovations positively impact sale prices in Ames, Iowa. Density plots show renovated homes fetch higher prices with more concentrated distributions, indicating higher median prices and increased market appeal. Scatter plots reveal renovated homes maintain higher values as they age compared to non-renovated ones. Boxplots by neighborhood confirm renovated homes command higher median prices, with significant premiums in desirable areas like StoneBr, NridgHt, and NoRidge. Variability in sale prices among renovated homes underscores differences in renovation extent and quality, affecting overall investment return. These analyses highlight renovations as a crucial factor in enhancing property values and a beneficial investment in the Ames housing market.

ames_housing$Renovated <- ifelse(ames_housing$YearRemodAdd > ames_housing$YearBuilt, "Renovated", "Not Renovated")

renovation_summary <- ames_housing %>%
  group_by(Renovated) %>%
  summarise(
    Count = n(),
    Mean = mean(SalePrice, na.rm = TRUE),
    Median = median(SalePrice, na.rm = TRUE),
    SD = sd(SalePrice, na.rm = TRUE)
  )
print(renovation_summary)
## # A tibble: 2 × 5
##   Renovated     Count    Mean Median     SD
##   <chr>         <int>   <dbl>  <dbl>  <dbl>
## 1 Not Renovated   764 182584. 170000 70334.
## 2 Renovated       696 179096. 155000 88383.
  • Density Plot of Sale Price by Renovation Status of House

    This density plot compares sale price distributions between renovated and non-renovated homes in Ames. Renovated homes generally show higher sale prices, with a sharper peak around a higher median price compared to non-renovated ones, indicating the positive impact of renovations on property values.

ggplot(ames_housing, aes(x = SalePrice, fill = Renovated)) +
  geom_density(alpha = 0.5) +
  labs(title = "Density Plot of Sale Prices by Renovation Status",
       x = "Sale Price",
       y = "Density",
       fill = "Renovated") +
  scale_fill_manual(values = c("lightblue", "orange")) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
    axis.title = element_text(size = 14),
    axis.text = element_text(size = 12)
  )

t_test_result <- t.test(SalePrice ~ Renovated, data = ames_housing)
print(t_test_result)
## 
##  Welch Two Sample t-test
## 
## data:  SalePrice by Renovated
## t = 0.82895, df = 1326.2, p-value = 0.4073
## alternative hypothesis: true difference in means between group Not Renovated and group Renovated is not equal to 0
## 95 percent confidence interval:
##  -4765.654 11740.359
## sample estimates:
## mean in group Not Renovated     mean in group Renovated 
##                    182583.7                    179096.3
  • Sale Price based on the age of house during renovation

    The scatter plot illustrates the relationship between home age at sale and sale prices, categorized by renovation status. Blue dots represent non-renovated homes, while red dots depict renovated ones. Both show a decline in sale prices as home age increases, but the slope is steeper for non-renovated homes, indicating greater depreciation with age. Renovated homes maintain higher prices, especially for older properties, suggesting renovations mitigate age-related value declines. Variability in sale prices among renovated homes reflects differences in renovation extent and effectiveness in enhancing property value.

ames_housing$AgeAtSale <- ames_housing$YrSold - ames_housing$YearBuilt

ggplot(ames_housing, aes(x = AgeAtSale, y = SalePrice, color = Renovated)) +
  geom_point(alpha = 0.6, size = 3) +
  geom_smooth(method = "lm", se = FALSE, size = 1, aes(group = Renovated)) +
  labs(title = "Sale Price vs. Age at Sale by Renovation Status",
       x = "Age at Sale (Years)",
       y = "Sale Price") +
  scale_color_manual(values = c("Not Renovated" = "blue", "Renovated" = "red")) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
    axis.title = element_text(size = 14),
    axis.text = element_text(size = 12)
  )
## `geom_smooth()` using formula = 'y ~ x'

  • Sale Price of House by Neighborhood and Renovation Status

    The boxplot visualizes sale prices of homes by neighborhood and renovation status in the Ames dataset. For each neighborhood, two boxplots are displayed side by side—orange for renovated homes and blue for non-renovated. Renovated homes generally exhibit higher median sale prices across almost all neighborhoods, notably in areas like StoneBr, NridgHt, and NoRidge, indicating a premium for upgrades in these locales. Additionally, the broader range of prices for renovated homes in several neighborhoods reflects varying impacts of renovations on property values, emphasizing the influence of neighborhood context on the return on investment in home upgrades.

ggplot(ames_housing, aes(x = Neighborhood, y = SalePrice, fill = Renovated)) +
  geom_boxplot(alpha = 0.7, outlier.shape = NA) +
  coord_flip() +
  labs(title = "Sale Price by Neighborhood and Renovation Status",
       x = "Neighborhood",
       y = "Sale Price",
       fill = "Renovated") +
  scale_fill_manual(values = c("Not Renovated" = "lightblue", "Renovated" = "orange")) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
    axis.title = element_text(size = 14),
    axis.text = element_text(size = 12)
  )

  • Impact of Renovations on Sale Price of the House

    This boxplot compares sale prices between renovated (orange) and non-renovated (blue) homes, showing higher median prices for renovated ones, denoted by red diamonds. The broader interquartile range in the renovated group suggests greater variability, likely due to differences in renovation quality. Overall, the plot underscores renovations as a valuable investment for boosting property value.

renovated_data <- ames_housing %>% filter(Renovated == "Renovated")

cor_matrix_renovated <- cor(renovated_data[which(sapply(renovated_data, is.numeric))], use = "complete.obs")
print(cor_matrix_renovated)
##                          Id   MSSubClass      LotArea  OverallQual  OverallCond
## Id             1.0000000000 -0.006639610 -0.022016621 -0.060289631  0.037909806
## MSSubClass    -0.0066396104  1.000000000 -0.125602271  0.062517468 -0.032679718
## LotArea       -0.0220166213 -0.125602271  1.000000000  0.167084313 -0.009618699
## OverallQual   -0.0602896311  0.062517468  0.167084313  1.000000000 -0.053667383
## OverallCond    0.0379098063 -0.032679718 -0.009618699 -0.053667383  1.000000000
## YearBuilt     -0.0399568478 -0.073834531  0.075330772  0.547681042 -0.319710604
## YearRemodAdd  -0.0640346042 -0.028065272  0.095042539  0.425911931  0.173148727
## MasVnrArea    -0.0488956115 -0.009192766  0.207099473  0.479914944 -0.145184992
## BsmtFinSF1    -0.0391894319 -0.108023622  0.233642972  0.356704165 -0.057463383
## BsmtFinSF2    -0.0178375056 -0.105310750  0.060086317 -0.028052039  0.043679262
## BsmtUnfSF      0.0081393797 -0.037705412 -0.009724210  0.241439802 -0.164626963
## TotalBsmtSF   -0.0396371046 -0.188452316  0.255173397  0.586746369 -0.198915206
## X1stFlrSF      0.0203323106 -0.176311812  0.334290895  0.535718231 -0.164060971
## X2ndFlrSF     -0.0007198465  0.380722483  0.090442641  0.296775715  0.035730682
## LowQualFinSF  -0.0671622956  0.073312677  0.011154863 -0.031385404 -0.012965561
## GrLivArea      0.0054387410  0.180230372  0.306014218  0.601955841 -0.089043685
## BsmtFullBath  -0.0248979455 -0.064913692  0.128148279  0.137280006 -0.079082676
## BsmtHalfBath  -0.0278538562  0.010634871  0.083906847 -0.008352248  0.147819316
## FullBath      -0.0083031047  0.199835903  0.167773990  0.537519876 -0.187917809
## HalfBath      -0.0175459828  0.131287802  0.052171698  0.301497106 -0.033796140
## BedroomAbvGr   0.0296183799  0.145950710  0.128185591  0.202243762  0.030732949
## KitchenAbvGr  -0.0234018748  0.441858582 -0.031203353 -0.136340432 -0.066524785
## TotRmsAbvGrd   0.0050853618  0.199538860  0.197419943  0.455275344 -0.074249810
## Fireplaces     0.0084739046 -0.025726795  0.303203671  0.445477486 -0.046408969
## GarageYrBlt    0.0043305595 -0.036827986  0.063282117  0.502061406 -0.236380028
## GarageCars    -0.0001871438 -0.030437865  0.224106130  0.608216689 -0.169377168
## GarageArea     0.0103609659 -0.074054376  0.228741545  0.556039810 -0.135474973
## WoodDeckSF    -0.0457744475 -0.036648791  0.204054218  0.269674224 -0.012192706
## OpenPorchSF   -0.0120882679 -0.006616687  0.110835032  0.261312933 -0.024727444
## EnclosedPorch  0.0061911531  0.047757220 -0.036208670 -0.143470852  0.031693529
## X3SsnPorch    -0.0447543072 -0.064113377  0.041591085 -0.001150516  0.065915198
## ScreenPorch    0.0101420678  0.026662815  0.028175899  0.075581831  0.087323870
## PoolArea      -0.0237825066 -0.013957198  0.037804421  0.032688407 -0.032863505
## MiscVal       -0.0330447108 -0.017313856  0.024315317 -0.018628474  0.071427779
## MoSold         0.0209367886 -0.022297585 -0.002003056  0.056159875  0.032781713
## YrSold         0.0413814441 -0.034270415 -0.044487296 -0.046477145  0.075629822
## SalePrice     -0.0351673374 -0.044642350  0.311834688  0.791809766 -0.076108324
## AgeAtSale      0.0415002160  0.072510112 -0.076975968 -0.549196907  0.322424570
##                   YearBuilt  YearRemodAdd   MasVnrArea   BsmtFinSF1
## Id            -0.0399568478 -0.0640346042 -0.048895611 -0.039189432
## MSSubClass    -0.0738345309 -0.0280652715 -0.009192766 -0.108023622
## LotArea        0.0753307718  0.0950425391  0.207099473  0.233642972
## OverallQual    0.5476810417  0.4259119312  0.479914944  0.356704165
## OverallCond   -0.3197106041  0.1731487275 -0.145184992 -0.057463383
## YearBuilt      1.0000000000  0.5579918455  0.417227574  0.392280191
## YearRemodAdd   0.5579918455  1.0000000000  0.240193970  0.277006123
## MasVnrArea     0.4172275739  0.2401939702  1.000000000  0.375127090
## BsmtFinSF1     0.3922801914  0.2770061229  0.375127090  1.000000000
## BsmtFinSF2     0.0102314962  0.0413587001 -0.059037727 -0.048888982
## BsmtUnfSF      0.1048564929  0.0456373270  0.066218839 -0.468564107
## TotalBsmtSF    0.5083688997  0.3452779580  0.426826173  0.566093088
## X1stFlrSF      0.4113056771  0.3399610634  0.400193402  0.504982181
## X2ndFlrSF     -0.0933264713 -0.0389362029  0.229647226 -0.128738855
## LowQualFinSF  -0.1749403311 -0.1028627429 -0.079261050 -0.073077025
## GrLivArea      0.1947836187  0.1957290692  0.448958254  0.245631540
## BsmtFullBath   0.2788631184  0.2094954274  0.136274807  0.649662154
## BsmtHalfBath   0.0195111209  0.0589708784  0.037829061  0.079159252
## FullBath       0.4376355807  0.3232159118  0.354460770  0.168849004
## HalfBath       0.2508164621  0.1716973924  0.260020168  0.049209772
## BedroomAbvGr  -0.0694496337  0.0003271869  0.109613808 -0.094221309
## KitchenAbvGr  -0.2109171381 -0.1303134932 -0.036974643 -0.061503305
## TotRmsAbvGrd   0.0805454600  0.1307529209  0.309220169  0.073172683
## Fireplaces     0.2428563179  0.1104378714  0.268966858  0.285069181
## GarageYrBlt    0.7808115142  0.5122964118  0.355357650  0.294241975
## GarageCars     0.5472277342  0.3703407757  0.434128981  0.333575437
## GarageArea     0.4898972075  0.3392711943  0.441508799  0.375789219
## WoodDeckSF     0.2802516778  0.2469665673  0.233667116  0.239887490
## OpenPorchSF    0.1484619547  0.1472928602  0.099611110  0.145155194
## EnclosedPorch -0.4225391910 -0.2415215831 -0.158127597 -0.143636712
## X3SsnPorch     0.0525336628  0.0692085136  0.001692652  0.021099719
## ScreenPorch   -0.0238284556  0.0120623637  0.038251562  0.035624870
## PoolArea      -0.0136803012  0.0200061646 -0.008463973  0.051913386
## MiscVal       -0.0167790183  0.0049087618 -0.025889577  0.004564695
## MoSold        -0.0001492084  0.0222041172 -0.015331347 -0.021723756
## YrSold         0.0073899790  0.0931027149 -0.016194832  0.045788190
## SalePrice      0.5613064396  0.4457476914  0.592684037  0.500689216
## AgeAtSale     -0.9992886157 -0.5542391451 -0.417657819 -0.390383608
##                 BsmtFinSF2     BsmtUnfSF  TotalBsmtSF    X1stFlrSF
## Id            -0.017837506  0.0081393797 -0.039637105  0.020332311
## MSSubClass    -0.105310750 -0.0377054121 -0.188452316 -0.176311812
## LotArea        0.060086317 -0.0097242103  0.255173397  0.334290895
## OverallQual   -0.028052039  0.2414398019  0.586746369  0.535718231
## OverallCond    0.043679262 -0.1646269635 -0.198915206 -0.164060971
## YearBuilt      0.010231496  0.1048564929  0.508368900  0.411305677
## YearRemodAdd   0.041358700  0.0456373270  0.345277958  0.339961063
## MasVnrArea    -0.059037727  0.0662188393  0.426826173  0.400193402
## BsmtFinSF1    -0.048888982 -0.4685641072  0.566093088  0.504982181
## BsmtFinSF2     1.000000000 -0.2191904765  0.131710707  0.123333770
## BsmtUnfSF     -0.219190477  1.0000000000  0.383122796  0.259588648
## TotalBsmtSF    0.131710707  0.3831227958  1.000000000  0.816100076
## X1stFlrSF      0.123333770  0.2595886484  0.816100076  1.000000000
## X2ndFlrSF     -0.104029730  0.0529425716 -0.123017972 -0.110154595
## LowQualFinSF   0.024496444  0.0386635311 -0.028981451 -0.008517507
## GrLivArea      0.008977191  0.2279325569  0.473812433  0.615344047
## BsmtFullBath   0.184316599 -0.4020311866  0.359275189  0.304219937
## BsmtHalfBath   0.022968436 -0.0995440528 -0.004155133 -0.028073202
## FullBath      -0.060610594  0.2539540931  0.392206552  0.459400155
## HalfBath      -0.020889658 -0.0030778592  0.039654873  0.007522238
## BedroomAbvGr  -0.043715442  0.1937302335  0.070179295  0.144486744
## KitchenAbvGr  -0.048287367  0.0189400828 -0.064266234 -0.002570116
## TotRmsAbvGrd  -0.051956262  0.2527588324  0.295782531  0.440271716
## Fireplaces     0.058216119  0.0561520129  0.370190960  0.460004164
## GarageYrBlt   -0.001802506  0.1290552646  0.425599256  0.383168193
## GarageCars    -0.003473848  0.1860054622  0.519726067  0.522270304
## GarageArea     0.007852866  0.1455919554  0.529212774  0.543036385
## WoodDeckSF     0.101763659 -0.0207645230  0.267390229  0.306111531
## OpenPorchSF    0.013583386  0.0967038544  0.247064638  0.221026630
## EnclosedPorch -0.032742899  0.0177872719 -0.143990779 -0.131021455
## X3SsnPorch    -0.027151777  0.0002356145  0.011372989  0.052785139
## ScreenPorch    0.065055835  0.0177230130  0.079034840  0.093181436
## PoolArea       0.078378261 -0.0673267824  0.020071808  0.023841599
## MiscVal       -0.017553525 -0.0287515737 -0.029523891 -0.037508446
## MoSold        -0.059112690  0.0337958583 -0.013330567  0.001220686
## YrSold         0.044933717 -0.0838250855 -0.015021335 -0.029749520
## SalePrice      0.012493079  0.1805041689  0.693068292  0.685510964
## AgeAtSale     -0.008532442 -0.1079724968 -0.508715453 -0.412249685
##                   X2ndFlrSF LowQualFinSF    GrLivArea BsmtFullBath
## Id            -0.0007198465 -0.067162296  0.005438741 -0.024897946
## MSSubClass     0.3807224830  0.073312677  0.180230372 -0.064913692
## LotArea        0.0904426411  0.011154863  0.306014218  0.128148279
## OverallQual    0.2967757148 -0.031385404  0.601955841  0.137280006
## OverallCond    0.0357306819 -0.012965561 -0.089043685 -0.079082676
## YearBuilt     -0.0933264713 -0.174940331  0.194783619  0.278863118
## YearRemodAdd  -0.0389362029 -0.102862743  0.195729069  0.209495427
## MasVnrArea     0.2296472264 -0.079261050  0.448958254  0.136274807
## BsmtFinSF1    -0.1287388550 -0.073077025  0.245631540  0.649662154
## BsmtFinSF2    -0.1040297297  0.024496444  0.008977191  0.184316599
## BsmtUnfSF      0.0529425716  0.038663531  0.227932557 -0.402031187
## TotalBsmtSF   -0.1230179723 -0.028981451  0.473812433  0.359275189
## X1stFlrSF     -0.1101545952 -0.008517507  0.615344047  0.304219937
## X2ndFlrSF      1.0000000000  0.071149984  0.706106697 -0.189675239
## LowQualFinSF   0.0711499844  1.000000000  0.172292691 -0.054789653
## GrLivArea      0.7061066972  0.172292691  1.000000000  0.059800225
## BsmtFullBath  -0.1896752391 -0.054789653  0.059800225  1.000000000
## BsmtHalfBath  -0.0246742601 -0.013497582 -0.040475630 -0.136195635
## FullBath       0.4117367621  0.008561864  0.642375016  0.019440640
## HalfBath       0.5373903329 -0.030175941  0.417847566 -0.011698472
## BedroomAbvGr   0.5785577967  0.151401807  0.568210825 -0.099019938
## KitchenAbvGr   0.1406924830  0.011127107  0.108553193 -0.049122088
## TotRmsAbvGrd   0.6527557029  0.179144977  0.836663589 -0.035065599
## Fireplaces     0.2046079923 -0.035077316  0.476973811  0.123015140
## GarageYrBlt   -0.0302644656 -0.089984746  0.234351994  0.201751509
## GarageCars     0.1473055987 -0.104511981  0.467727413  0.165608043
## GarageArea     0.1067700285 -0.068292563  0.455359450  0.202026100
## WoodDeckSF     0.0459586009 -0.026545423  0.247137941  0.168518634
## OpenPorchSF    0.1793530148  0.017376421  0.296173980  0.066650395
## EnclosedPorch  0.0955708126  0.050490384 -0.011689540 -0.110295667
## X3SsnPorch    -0.0312375867 -0.007563255  0.011911167  0.003572285
## ScreenPorch    0.0689403622  0.038248170  0.123502130 -0.017347176
## PoolArea       0.0251401179  0.123236596  0.051381868  0.063381146
## MiscVal       -0.0241323959 -0.008031946 -0.046003696 -0.033475603
## MoSold         0.0578180614 -0.029448701  0.042018183 -0.064977973
## YrSold        -0.0751744197 -0.040035047 -0.084039695  0.112956054
## SalePrice      0.3034291010 -0.028278444  0.712605540  0.259817212
## AgeAtSale      0.0904509684  0.173354759 -0.197868807 -0.274482444
##                BsmtHalfBath     FullBath     HalfBath  BedroomAbvGr
## Id            -0.0278538562 -0.008303105 -0.017545983  0.0296183799
## MSSubClass     0.0106348715  0.199835903  0.131287802  0.1459507099
## LotArea        0.0839068465  0.167773990  0.052171698  0.1281855914
## OverallQual   -0.0083522477  0.537519876  0.301497106  0.2022437617
## OverallCond    0.1478193160 -0.187917809 -0.033796140  0.0307329488
## YearBuilt      0.0195111209  0.437635581  0.250816462 -0.0694496337
## YearRemodAdd   0.0589708784  0.323215912  0.171697392  0.0003271869
## MasVnrArea     0.0378290608  0.354460770  0.260020168  0.1096138077
## BsmtFinSF1     0.0791592516  0.168849004  0.049209772 -0.0942213086
## BsmtFinSF2     0.0229684363 -0.060610594 -0.020889658 -0.0437154419
## BsmtUnfSF     -0.0995440528  0.253954093 -0.003077859  0.1937302335
## TotalBsmtSF   -0.0041551333  0.392206552  0.039654873  0.0701792954
## X1stFlrSF     -0.0280732022  0.459400155  0.007522238  0.1444867442
## X2ndFlrSF     -0.0246742601  0.411736762  0.537390333  0.5785577967
## LowQualFinSF  -0.0134975818  0.008561864 -0.030175941  0.1514018069
## GrLivArea     -0.0404756297  0.642375016  0.417847566  0.5682108253
## BsmtFullBath  -0.1361956351  0.019440640 -0.011698472 -0.0990199381
## BsmtHalfBath   1.0000000000 -0.091267748 -0.007769430  0.0197355422
## FullBath      -0.0912677477  1.000000000  0.130732330  0.3927814312
## HalfBath      -0.0077694301  0.130732330  1.000000000  0.2159327616
## BedroomAbvGr   0.0197355422  0.392781431  0.215932762  1.0000000000
## KitchenAbvGr  -0.0304846954  0.174014882 -0.106025118  0.1507613694
## TotRmsAbvGrd  -0.0560859661  0.574810859  0.324532042  0.6775038509
## Fireplaces     0.0201455382  0.282806924  0.232194638  0.1555681897
## GarageYrBlt   -0.0066238021  0.411006686  0.208177429 -0.0545290540
## GarageCars     0.0004902256  0.470736309  0.238102447  0.0913750111
## GarageArea     0.0054796182  0.418959010  0.180775474  0.0561159618
## WoodDeckSF     0.0354795574  0.231975787  0.083135535  0.0294211851
## OpenPorchSF   -0.0251618573  0.220458505  0.184205108  0.1115820634
## EnclosedPorch -0.0408969773 -0.126014065 -0.117835501  0.0521008419
## X3SsnPorch     0.0654300393  0.016033824  0.037521372 -0.0090476342
## ScreenPorch   -0.0048850467  0.016956458  0.080437495  0.0819139077
## PoolArea       0.0776798339 -0.007244223  0.025000048  0.0355388285
## MiscVal       -0.0111828857 -0.026956401 -0.055055315 -0.0223507184
## MoSold         0.0155478955  0.042139011  0.008944542  0.0576824297
## YrSold        -0.0589132695 -0.022510252 -0.048204444 -0.0643252051
## SalePrice     -0.0019700863  0.566740338  0.339805234  0.2160076068
## AgeAtSale     -0.0217245295 -0.438295175 -0.252525918  0.0669936283
##               KitchenAbvGr TotRmsAbvGrd   Fireplaces  GarageYrBlt    GarageCars
## Id            -0.023401875  0.005085362  0.008473905  0.004330559 -0.0001871438
## MSSubClass     0.441858582  0.199538860 -0.025726795 -0.036827986 -0.0304378654
## LotArea       -0.031203353  0.197419943  0.303203671  0.063282117  0.2241061296
## OverallQual   -0.136340432  0.455275344  0.445477486  0.502061406  0.6082166888
## OverallCond   -0.066524785 -0.074249810 -0.046408969 -0.236380028 -0.1693771680
## YearBuilt     -0.210917138  0.080545460  0.242856318  0.780811514  0.5472277342
## YearRemodAdd  -0.130313493  0.130752921  0.110437871  0.512296412  0.3703407757
## MasVnrArea    -0.036974643  0.309220169  0.268966858  0.355357650  0.4341289806
## BsmtFinSF1    -0.061503305  0.073172683  0.285069181  0.294241975  0.3335754373
## BsmtFinSF2    -0.048287367 -0.051956262  0.058216119 -0.001802506 -0.0034738476
## BsmtUnfSF      0.018940083  0.252758832  0.056152013  0.129055265  0.1860054622
## TotalBsmtSF   -0.064266234  0.295782531  0.370190960  0.425599256  0.5197260666
## X1stFlrSF     -0.002570116  0.440271716  0.460004164  0.383168193  0.5222703035
## X2ndFlrSF      0.140692483  0.652755703  0.204607992 -0.030264466  0.1473055987
## LowQualFinSF   0.011127107  0.179144977 -0.035077316 -0.089984746 -0.1045119807
## GrLivArea      0.108553193  0.836663589  0.476973811  0.234351994  0.4677274135
## BsmtFullBath  -0.049122088 -0.035065599  0.123015140  0.201751509  0.1656080428
## BsmtHalfBath  -0.030484695 -0.056085966  0.020145538 -0.006623802  0.0004902256
## FullBath       0.174014882  0.574810859  0.282806924  0.411006686  0.4707363087
## HalfBath      -0.106025118  0.324532042  0.232194638  0.208177429  0.2381024469
## BedroomAbvGr   0.150761369  0.677503851  0.155568190 -0.054529054  0.0913750111
## KitchenAbvGr   1.000000000  0.240166904 -0.099630056 -0.166511210 -0.0365934148
## TotRmsAbvGrd   0.240166904  1.000000000  0.351792126  0.134455453  0.3550874341
## Fireplaces    -0.099630056  0.351792126  1.000000000  0.161075101  0.3265758686
## GarageYrBlt   -0.166511210  0.134455453  0.161075101  1.000000000  0.6653460368
## GarageCars    -0.036593415  0.355087434  0.326575869  0.665346037  1.0000000000
## GarageArea    -0.048666291  0.325033861  0.282556271  0.667554488  0.8962457653
## WoodDeckSF    -0.084502194  0.154060230  0.205871356  0.306703619  0.2952581473
## OpenPorchSF   -0.032090697  0.229823575  0.158030643  0.141370803  0.1617362530
## EnclosedPorch  0.086887819  0.002858394 -0.105589401 -0.324049169 -0.1605543387
## X3SsnPorch    -0.027910613  0.002951087 -0.027536537  0.050147073  0.0349356013
## ScreenPorch   -0.048987716  0.060826861  0.208893585 -0.009249931  0.0436412649
## PoolArea      -0.011203489 -0.009867632  0.027795577 -0.032532082  0.0222191055
## MiscVal        0.015862258 -0.032963307 -0.023866778 -0.016535865 -0.0540296818
## MoSold         0.020164490  0.033196699  0.008965144 -0.007270497  0.0181558259
## YrSold        -0.009185670 -0.079425737 -0.027513015  0.002061593 -0.0482939056
## SalePrice     -0.118016676  0.541576554  0.520558284  0.512555981  0.6524512682
## AgeAtSale      0.210479451 -0.083506066 -0.243788863 -0.780395923 -0.5488123130
##                 GarageArea  WoodDeckSF  OpenPorchSF EnclosedPorch    X3SsnPorch
## Id             0.010360966 -0.04577445 -0.012088268   0.006191153 -0.0447543072
## MSSubClass    -0.074054376 -0.03664879 -0.006616687   0.047757220 -0.0641133767
## LotArea        0.228741545  0.20405422  0.110835032  -0.036208670  0.0415910854
## OverallQual    0.556039810  0.26967422  0.261312933  -0.143470852 -0.0011505160
## OverallCond   -0.135474973 -0.01219271 -0.024727444   0.031693529  0.0659151977
## YearBuilt      0.489897207  0.28025168  0.148461955  -0.422539191  0.0525336628
## YearRemodAdd   0.339271194  0.24696657  0.147292860  -0.241521583  0.0692085136
## MasVnrArea     0.441508799  0.23366712  0.099611110  -0.158127597  0.0016926525
## BsmtFinSF1     0.375789219  0.23988749  0.145155194  -0.143636712  0.0210997190
## BsmtFinSF2     0.007852866  0.10176366  0.013583386  -0.032742899 -0.0271517770
## BsmtUnfSF      0.145591955 -0.02076452  0.096703854   0.017787272  0.0002356145
## TotalBsmtSF    0.529212774  0.26739023  0.247064638  -0.143990779  0.0113729885
## X1stFlrSF      0.543036385  0.30611153  0.221026630  -0.131021455  0.0527851393
## X2ndFlrSF      0.106770028  0.04595860  0.179353015   0.095570813 -0.0312375867
## LowQualFinSF  -0.068292563 -0.02654542  0.017376421   0.050490384 -0.0075632547
## GrLivArea      0.455359450  0.24713794  0.296173980  -0.011689540  0.0119111671
## BsmtFullBath   0.202026100  0.16851863  0.066650395  -0.110295667  0.0035722852
## BsmtHalfBath   0.005479618  0.03547956 -0.025161857  -0.040896977  0.0654300393
## FullBath       0.418959010  0.23197579  0.220458505  -0.126014065  0.0160338236
## HalfBath       0.180775474  0.08313553  0.184205108  -0.117835501  0.0375213719
## BedroomAbvGr   0.056115962  0.02942119  0.111582063   0.052100842 -0.0090476342
## KitchenAbvGr  -0.048666291 -0.08450219 -0.032090697   0.086887819 -0.0279106127
## TotRmsAbvGrd   0.325033861  0.15406023  0.229823575   0.002858394  0.0029510874
## Fireplaces     0.282556271  0.20587136  0.158030643  -0.105589401 -0.0275365373
## GarageYrBlt    0.667554488  0.30670362  0.141370803  -0.324049169  0.0501470733
## GarageCars     0.896245765  0.29525815  0.161736253  -0.160554339  0.0349356013
## GarageArea     1.000000000  0.29420256  0.192798635  -0.134246874  0.0446375168
## WoodDeckSF     0.294202564  1.00000000  0.030751748  -0.163649643  0.0112990601
## OpenPorchSF    0.192798635  0.03075175  1.000000000  -0.144953640 -0.0029524114
## EnclosedPorch -0.134246874 -0.16364964 -0.144953640   1.000000000 -0.0610071747
## X3SsnPorch     0.044637517  0.01129906 -0.002952411  -0.061007175  1.0000000000
## ScreenPorch    0.047815348 -0.09022230  0.120883120  -0.106555424 -0.0381667912
## PoolArea       0.039392620  0.02535036 -0.029246218   0.155303012 -0.0070817515
## MiscVal       -0.036670177 -0.02801053 -0.020976509  -0.024786225  0.0062928731
## MoSold         0.008793856  0.03438502  0.105934597  -0.086582786  0.0543260922
## YrSold        -0.034170745  0.04260934 -0.042470684  -0.033169470  0.0061757417
## SalePrice      0.633305006  0.34838218  0.272704102  -0.171166656  0.0220926009
## AgeAtSale     -0.490973952 -0.27852345 -0.149999455   0.421105416 -0.0522780211
##                ScreenPorch     PoolArea      MiscVal        MoSold       YrSold
## Id             0.010142068 -0.023782507 -0.033044711  0.0209367886  0.041381444
## MSSubClass     0.026662815 -0.013957198 -0.017313856 -0.0222975847 -0.034270415
## LotArea        0.028175899  0.037804421  0.024315317 -0.0020030558 -0.044487296
## OverallQual    0.075581831  0.032688407 -0.018628474  0.0561598755 -0.046477145
## OverallCond    0.087323870 -0.032863505  0.071427779  0.0327817131  0.075629822
## YearBuilt     -0.023828456 -0.013680301 -0.016779018 -0.0001492084  0.007389979
## YearRemodAdd   0.012062364  0.020006165  0.004908762  0.0222041172  0.093102715
## MasVnrArea     0.038251562 -0.008463973 -0.025889577 -0.0153313469 -0.016194832
## BsmtFinSF1     0.035624870  0.051913386  0.004564695 -0.0217237560  0.045788190
## BsmtFinSF2     0.065055835  0.078378261 -0.017553525 -0.0591126904  0.044933717
## BsmtUnfSF      0.017723013 -0.067326782 -0.028751574  0.0337958583 -0.083825086
## TotalBsmtSF    0.079034840  0.020071808 -0.029523891 -0.0133305673 -0.015021335
## X1stFlrSF      0.093181436  0.023841599 -0.037508446  0.0012206862 -0.029749520
## X2ndFlrSF      0.068940362  0.025140118 -0.024132396  0.0578180614 -0.075174420
## LowQualFinSF   0.038248170  0.123236596 -0.008031946 -0.0294487014 -0.040035047
## GrLivArea      0.123502130  0.051381868 -0.046003696  0.0420181834 -0.084039695
## BsmtFullBath  -0.017347176  0.063381146 -0.033475603 -0.0649779728  0.112956054
## BsmtHalfBath  -0.004885047  0.077679834 -0.011182886  0.0155478955 -0.058913270
## FullBath       0.016956458 -0.007244223 -0.026956401  0.0421390114 -0.022510252
## HalfBath       0.080437495  0.025000048 -0.055055315  0.0089445418 -0.048204444
## BedroomAbvGr   0.081913908  0.035538829 -0.022350718  0.0576824297 -0.064325205
## KitchenAbvGr  -0.048987716 -0.011203489  0.015862258  0.0201644900 -0.009185670
## TotRmsAbvGrd   0.060826861 -0.009867632 -0.032963307  0.0331966989 -0.079425737
## Fireplaces     0.208893585  0.027795577 -0.023866778  0.0089651440 -0.027513015
## GarageYrBlt   -0.009249931 -0.032532082 -0.016535865 -0.0072704967  0.002061593
## GarageCars     0.043641265  0.022219105 -0.054029682  0.0181558259 -0.048293906
## GarageArea     0.047815348  0.039392620 -0.036670177  0.0087938565 -0.034170745
## WoodDeckSF    -0.090222304  0.025350365 -0.028010525  0.0343850193  0.042609338
## OpenPorchSF    0.120883120 -0.029246218 -0.020976509  0.1059345966 -0.042470684
## EnclosedPorch -0.106555424  0.155303012 -0.024786225 -0.0865827864 -0.033169470
## X3SsnPorch    -0.038166791 -0.007081751  0.006292873  0.0543260922  0.006175742
## ScreenPorch    1.000000000 -0.015320381  0.013011765  0.0030214365 -0.001870863
## PoolArea      -0.015320381  1.000000000 -0.004921803 -0.0878305931 -0.075501117
## MiscVal        0.013011765 -0.004921803  1.000000000 -0.0271438064  0.014712001
## MoSold         0.003021436 -0.087830593 -0.027143806  1.0000000000 -0.118265069
## YrSold        -0.001870863 -0.075501117  0.014712001 -0.1182650692  1.000000000
## SalePrice      0.123848829  0.015537571 -0.025999230 -0.0184071939 -0.043878232
## AgeAtSale      0.023747588  0.010826938  0.017326606 -0.0043110962  0.030327145
##                  SalePrice    AgeAtSale
## Id            -0.035167337  0.041500216
## MSSubClass    -0.044642350  0.072510112
## LotArea        0.311834688 -0.076975968
## OverallQual    0.791809766 -0.549196907
## OverallCond   -0.076108324  0.322424570
## YearBuilt      0.561306440 -0.999288616
## YearRemodAdd   0.445747691 -0.554239145
## MasVnrArea     0.592684037 -0.417657819
## BsmtFinSF1     0.500689216 -0.390383608
## BsmtFinSF2     0.012493079 -0.008532442
## BsmtUnfSF      0.180504169 -0.107972497
## TotalBsmtSF    0.693068292 -0.508715453
## X1stFlrSF      0.685510964 -0.412249685
## X2ndFlrSF      0.303429101  0.090450968
## LowQualFinSF  -0.028278444  0.173354759
## GrLivArea      0.712605540 -0.197868807
## BsmtFullBath   0.259817212 -0.274482444
## BsmtHalfBath  -0.001970086 -0.021724530
## FullBath       0.566740338 -0.438295175
## HalfBath       0.339805234 -0.252525918
## BedroomAbvGr   0.216007607  0.066993628
## KitchenAbvGr  -0.118016676  0.210479451
## TotRmsAbvGrd   0.541576554 -0.083506066
## Fireplaces     0.520558284 -0.243788863
## GarageYrBlt    0.512555981 -0.780395923
## GarageCars     0.652451268 -0.548812313
## GarageArea     0.633305006 -0.490973952
## WoodDeckSF     0.348382180 -0.278523453
## OpenPorchSF    0.272704102 -0.149999455
## EnclosedPorch -0.171166656  0.421105416
## X3SsnPorch     0.022092601 -0.052278021
## ScreenPorch    0.123848829  0.023747588
## PoolArea       0.015537571  0.010826938
## MiscVal       -0.025999230  0.017326606
## MoSold        -0.018407194 -0.004311096
## YrSold        -0.043878232  0.030327145
## SalePrice      1.000000000 -0.562718394
## AgeAtSale     -0.562718394  1.000000000
ggplot(ames_housing, aes(x = Renovated, y = SalePrice, fill = Renovated)) +
  geom_boxplot(alpha = 0.7, outlier.shape = NA) +
  stat_summary(fun = mean, geom = "point", shape = 18, size = 4, color = "red") +
  labs(title = "Impact of Renovations on Sale Prices",
       x = "Renovation Status",
       y = "Sale Price") +
  scale_fill_manual(values = c("Not Renovated" = "#1f77b4", "Renovated" = "#ff7f0e")) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
    axis.title = element_text(size = 14),
    axis.text = element_text(size = 12)
  )

  1. How does energy efficiency and utilities impact the sale price of a house?

    Data analysis reveals the significant impact of energy efficiency and utilities on house sale prices. Homes with all public utilities command higher prices across all heating quality categories, particularly those with excellent heating systems. Conversely, properties with fair or poor heating quality, regardless of utilities, have notably lower sale prices. The distribution of sale prices emphasizes these trends, highlighting the importance of investing in quality heating systems and ensuring access to basic utility services to maximize home sale prices.

  • Sale Price of House by Utility and Heating Quality

    The chart demonstrates that homes with all public utilities generally command higher prices across all heating quality categories, with superior heating quality associated with the highest prices. Conversely, properties with fair or poor heating quality, regardless of utilities, tend to have lower sale prices, emphasizing the negative impact of suboptimal heating on property values. This underscores the importance of essential services and effective heating in enhancing residential property marketability and value.

ggplot(ames_housing, aes(x = Utilities, y = SalePrice, fill = HeatingQC)) +
  geom_boxplot(alpha = 0.7, outlier.shape = NA) +
  facet_wrap(~ HeatingQC, scales = "free_y") +
  labs(title = "Sale Prices by Utility Type and Heating Quality",
       x = "Utilities", y = "Sale Price") +
  scale_fill_viridis_d(option = "inferno") +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
    legend.position = "bottom",
    axis.text.x = element_text(angle = 45, hjust = 1),
    axis.title = element_text(size = 14),
    axis.text = element_text(size = 12)
  )

  • Density of Sale Price by Utilities and Heating Quality

    This density plot series reveals sale price distributions categorized by utilities (All Public Utilities vs. No Sewer/Water) and heating quality (Excellent, Good, Average, Fair, Poor). Homes with all public utilities generally exhibit higher concentrations of sale prices around favorable values, particularly for excellent and good heating quality, indicating higher median sale prices. Conversely, densities for homes without sewer and water are not visible, suggesting fewer data points or lower prices. Overall, the data highlights the significant impact of utilities and heating quality on home values.

ames_housing$Utilities <- ifelse(ames_housing$Utilities == "NoSeWa", "AllPub", ames_housing$Utilities)
ames_housing$HeatingQC <- ifelse(ames_housing$HeatingQC == "Po", "Fa", ames_housing$HeatingQC)
table(ames_housing$Utilities)
## 
## AllPub 
##   1460
table(ames_housing$HeatingQC)
## 
##  Ex  Fa  Gd  TA 
## 741  50 241 428
ames_housing$Utilities <- ifelse(ames_housing$Utilities == "NoSeWa", "AllPub", ames_housing$Utilities)
ames_housing$HeatingQC <- ifelse(ames_housing$HeatingQC == "Po", "Fa", ames_housing$HeatingQC)

ggplot(ames_housing, aes(x = SalePrice, fill = Utilities)) +
  geom_density(alpha = 0.6) +
  facet_wrap(~ HeatingQC, scales = "free") +
  labs(title = "Density of Sale Prices by Utilities and Heating Quality",
       x = "Sale Price", y = "Density") +
  scale_fill_viridis_d(option = "plasma") +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
    legend.position = "bottom",
    axis.title = element_text(size = 14),
    axis.text = element_text(size = 12)
  )

  • Interaction of Utilities and Heating Quality on Sale Price of House

    This plot shows how utilities and heating quality influence sale prices. Homes with all public utilities (AllPub) exhibit a wide range of sale prices, indicating varied heating qualities from excellent to poor. Conversely, homes without sewer or water (NoSeWa) are rare and generally have lower sale prices, highlighting the importance of standard utilities in maintaining property value.

ggplot(ames_housing, aes(x = Utilities, y = SalePrice, color = HeatingQC)) +
  geom_point(alpha = 0.6) +
  geom_smooth(method = "lm", se = TRUE, fill = "lightblue", alpha = 0.3, aes(group = HeatingQC)) +
  labs(title = "Interaction of Utilities and Heating Quality on Sale Prices",
       x = "Utilities", y = "Sale Price") +
  scale_color_brewer(palette = "Dark2") +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
    axis.title = element_text(size = 14),
    axis.text = element_text(size = 12)
  )
## `geom_smooth()` using formula = 'y ~ x'

stats_summary <- ames_housing %>%
  group_by(Utilities, HeatingQC) %>%
  summarise(
    Count = n(),
    Mean = mean(SalePrice, na.rm = TRUE),
    Median = median(SalePrice, na.rm = TRUE),
    SD = sd(SalePrice, na.rm = TRUE)
  )
## `summarise()` has grouped output by 'Utilities'. You can override using the
## `.groups` argument.
print(stats_summary)
## # A tibble: 4 × 6
## # Groups:   Utilities [1]
##   Utilities HeatingQC Count    Mean Median     SD
##   <chr>     <chr>     <int>   <dbl>  <dbl>  <dbl>
## 1 AllPub    Ex          741 214914. 194700 87470.
## 2 AllPub    Fa           50 123181. 122750 50064.
## 3 AllPub    Gd          241 156859. 152000 52924.
## 4 AllPub    TA          428 142363. 135000 47226.
correlations <- ames_housing %>%
  group_by(Utilities, HeatingQC) %>%
  summarise(Correlation = cor(SalePrice, LotArea, use = "complete.obs"), .groups = 'drop')
print(correlations)
## # A tibble: 4 × 3
##   Utilities HeatingQC Correlation
##   <chr>     <chr>           <dbl>
## 1 AllPub    Ex              0.256
## 2 AllPub    Fa              0.566
## 3 AllPub    Gd              0.334
## 4 AllPub    TA              0.453
  • Comparison of Sale Prices: High vs Low Heating Quality

    This density plot contrasts sale price distributions for homes with high and low heating quality. The green curve suggests clustered pricing for high-quality heating homes, while the absence of the red curve within the visible range implies limited data or overlap with higher-quality homes. The visualization emphasizes the significant influence of heating quality on home prices, with better quality likely contributing to higher and more concentrated values, reflecting buyer preference for comfort and efficiency.

high_efficiency <- ames_housing %>% filter(HeatingQC == "Ex")
low_efficiency <- ames_housing %>% filter(HeatingQC == "Po")

ggplot() +
  geom_density(data = high_efficiency, aes(x = SalePrice, fill = "High"), alpha = 0.5) +
  geom_density(data = low_efficiency, aes(x = SalePrice, fill = "Low"), alpha = 0.5) +
  labs(title = "Comparison of Sale Prices: High vs Low Heating Quality",
       x = "Sale Price", y = "Density",
       fill = "Heating Quality") +
  scale_fill_manual(values = c("High" = "green", "Low" = "red")) +
  theme_minimal() +
  theme(
    legend.position = "top",
    plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
    axis.title = element_text(size = 14),
    axis.text = element_text(size = 12)
  )

  1. What is the impact of landscape and outdoor features on the sale price of a house?

    Comprehensive data analysis reveals landscape and outdoor features, particularly pools, significantly impact house sale prices. Homes with pools command higher median sale prices, indicating their substantial role in enhancing property value, adding luxury and aesthetic appeal. Variability in pool-equipped home prices suggests factors like size, style, and maintenance influence sale prices. Distribution analysis shows broader price ranges and significant right skew for pool-equipped homes compared to those without, emphasizing their desirability. The positive correlation between lot area and sale price for pool-equipped homes highlights the added value of larger lots, further boosting property values. Overall, this data underscores the significance of landscape features, especially pools, in influencing home sale prices, showcasing their substantial contribution to property valuation in the real estate market.

  • Median Sale Price of House by Pool

    This bar chart compares median sale prices of homes with and without pools, showing higher prices for pool-equipped homes, emphasizing their significant role in enhancing property value. The broader range of sale prices among homes with pools suggests factors like size, style, and maintenance influence prices. Overall, the data highlights pools as desirable features significantly impacting home valuation in the real estate market.

str(ameshous_test_data)
## 'data.frame':    1459 obs. of  80 variables:
##  $ Id           : int  1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 ...
##  $ MSSubClass   : int  20 20 60 60 120 60 20 60 20 20 ...
##  $ MSZoning     : chr  "RH" "RL" "RL" "RL" ...
##  $ LotFrontage  : int  80 81 74 78 43 75 NA 63 85 70 ...
##  $ LotArea      : int  11622 14267 13830 9978 5005 10000 7980 8402 10176 8400 ...
##  $ Street       : chr  "Pave" "Pave" "Pave" "Pave" ...
##  $ Alley        : chr  NA NA NA NA ...
##  $ LotShape     : chr  "Reg" "IR1" "IR1" "IR1" ...
##  $ LandContour  : chr  "Lvl" "Lvl" "Lvl" "Lvl" ...
##  $ Utilities    : chr  "AllPub" "AllPub" "AllPub" "AllPub" ...
##  $ LotConfig    : chr  "Inside" "Corner" "Inside" "Inside" ...
##  $ LandSlope    : chr  "Gtl" "Gtl" "Gtl" "Gtl" ...
##  $ Neighborhood : chr  "NAmes" "NAmes" "Gilbert" "Gilbert" ...
##  $ Condition1   : chr  "Feedr" "Norm" "Norm" "Norm" ...
##  $ Condition2   : chr  "Norm" "Norm" "Norm" "Norm" ...
##  $ BldgType     : chr  "1Fam" "1Fam" "1Fam" "1Fam" ...
##  $ HouseStyle   : chr  "1Story" "1Story" "2Story" "2Story" ...
##  $ OverallQual  : int  5 6 5 6 8 6 6 6 7 4 ...
##  $ OverallCond  : int  6 6 5 6 5 5 7 5 5 5 ...
##  $ YearBuilt    : int  1961 1958 1997 1998 1992 1993 1992 1998 1990 1970 ...
##  $ YearRemodAdd : int  1961 1958 1998 1998 1992 1994 2007 1998 1990 1970 ...
##  $ RoofStyle    : chr  "Gable" "Hip" "Gable" "Gable" ...
##  $ RoofMatl     : chr  "CompShg" "CompShg" "CompShg" "CompShg" ...
##  $ Exterior1st  : chr  "VinylSd" "Wd Sdng" "VinylSd" "VinylSd" ...
##  $ Exterior2nd  : chr  "VinylSd" "Wd Sdng" "VinylSd" "VinylSd" ...
##  $ MasVnrType   : chr  "None" "BrkFace" "None" "BrkFace" ...
##  $ MasVnrArea   : int  0 108 0 20 0 0 0 0 0 0 ...
##  $ ExterQual    : chr  "TA" "TA" "TA" "TA" ...
##  $ ExterCond    : chr  "TA" "TA" "TA" "TA" ...
##  $ Foundation   : chr  "CBlock" "CBlock" "PConc" "PConc" ...
##  $ BsmtQual     : chr  "TA" "TA" "Gd" "TA" ...
##  $ BsmtCond     : chr  "TA" "TA" "TA" "TA" ...
##  $ BsmtExposure : chr  "No" "No" "No" "No" ...
##  $ BsmtFinType1 : chr  "Rec" "ALQ" "GLQ" "GLQ" ...
##  $ BsmtFinSF1   : int  468 923 791 602 263 0 935 0 637 804 ...
##  $ BsmtFinType2 : chr  "LwQ" "Unf" "Unf" "Unf" ...
##  $ BsmtFinSF2   : int  144 0 0 0 0 0 0 0 0 78 ...
##  $ BsmtUnfSF    : int  270 406 137 324 1017 763 233 789 663 0 ...
##  $ TotalBsmtSF  : int  882 1329 928 926 1280 763 1168 789 1300 882 ...
##  $ Heating      : chr  "GasA" "GasA" "GasA" "GasA" ...
##  $ HeatingQC    : chr  "TA" "TA" "Gd" "Ex" ...
##  $ CentralAir   : chr  "Y" "Y" "Y" "Y" ...
##  $ Electrical   : chr  "SBrkr" "SBrkr" "SBrkr" "SBrkr" ...
##  $ X1stFlrSF    : int  896 1329 928 926 1280 763 1187 789 1341 882 ...
##  $ X2ndFlrSF    : int  0 0 701 678 0 892 0 676 0 0 ...
##  $ LowQualFinSF : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ GrLivArea    : int  896 1329 1629 1604 1280 1655 1187 1465 1341 882 ...
##  $ BsmtFullBath : int  0 0 0 0 0 0 1 0 1 1 ...
##  $ BsmtHalfBath : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ FullBath     : int  1 1 2 2 2 2 2 2 1 1 ...
##  $ HalfBath     : int  0 1 1 1 0 1 0 1 1 0 ...
##  $ BedroomAbvGr : int  2 3 3 3 2 3 3 3 2 2 ...
##  $ KitchenAbvGr : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ KitchenQual  : chr  "TA" "Gd" "TA" "Gd" ...
##  $ TotRmsAbvGrd : int  5 6 6 7 5 7 6 7 5 4 ...
##  $ Functional   : chr  "Typ" "Typ" "Typ" "Typ" ...
##  $ Fireplaces   : int  0 0 1 1 0 1 0 1 1 0 ...
##  $ FireplaceQu  : chr  NA NA "TA" "Gd" ...
##  $ GarageType   : chr  "Attchd" "Attchd" "Attchd" "Attchd" ...
##  $ GarageYrBlt  : int  1961 1958 1997 1998 1992 1993 1992 1998 1990 1970 ...
##  $ GarageFinish : chr  "Unf" "Unf" "Fin" "Fin" ...
##  $ GarageCars   : int  1 1 2 2 2 2 2 2 2 2 ...
##  $ GarageArea   : int  730 312 482 470 506 440 420 393 506 525 ...
##  $ GarageQual   : chr  "TA" "TA" "TA" "TA" ...
##  $ GarageCond   : chr  "TA" "TA" "TA" "TA" ...
##  $ PavedDrive   : chr  "Y" "Y" "Y" "Y" ...
##  $ WoodDeckSF   : int  140 393 212 360 0 157 483 0 192 240 ...
##  $ OpenPorchSF  : int  0 36 34 36 82 84 21 75 0 0 ...
##  $ EnclosedPorch: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ X3SsnPorch   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ ScreenPorch  : int  120 0 0 0 144 0 0 0 0 0 ...
##  $ PoolArea     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ PoolQC       : chr  NA NA NA NA ...
##  $ Fence        : chr  "MnPrv" NA "MnPrv" NA ...
##  $ MiscFeature  : chr  NA "Gar2" NA NA ...
##  $ MiscVal      : int  0 12500 0 0 0 0 500 0 0 0 ...
##  $ MoSold       : int  6 6 3 6 1 4 3 5 2 4 ...
##  $ YrSold       : int  2010 2010 2010 2010 2010 2010 2010 2010 2010 2010 ...
##  $ SaleType     : chr  "WD" "WD" "WD" "WD" ...
##  $ SaleCondition: chr  "Normal" "Normal" "Normal" "Normal" ...
table(ames_housing$PoolQC)
## 
##   Ex   Fa   Gd None 
##    2    2    3 1453
ames_housing$HasPool <- factor(ifelse(ames_housing$PoolQC %in% c("Ex", "Gd", "TA", "Fa"), "Yes", "No"),
                               levels = c("No", "Yes"),
                               labels = c("No Pool", "Has Pool"))

# Plot
ggplot(ames_housing, aes(x = factor(HasPool), y = SalePrice, fill = factor(HasPool))) +
  stat_summary(fun = median, geom = "bar", position = position_dodge(width = 0.8), width = 0.6) +
  stat_summary(fun.data = mean_se, geom = "errorbar", position = position_dodge(width = 0.8), width = 0.2) +
  labs(title = "Median Sale Prices by Pool Presence",
       x = "Has Pool", y = "Median Sale Price") +
  scale_fill_manual(values = c("No Pool" = "lightblue", "Has Pool" = "orange")) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
    axis.title = element_text(size = 14),
    axis.text = element_text(size = 12)
  )

  • Density of Sale Price by availability pool in the House

    This density plot compares sale price distributions based on pool presence. Homes without a pool exhibit a tighter distribution with lower median prices, while those with a pool show a broader distribution and significant right skew, indicating higher median prices and a tail of high-value transactions. The data suggests pools add a luxury element, significantly boosting property values, particularly at the upper end of the market.

ggplot(ames_housing, aes(x = SalePrice, fill = factor(HasPool))) +
  geom_density(alpha = 0.5) +
  geom_vline(aes(xintercept = median(SalePrice)), color = "black", linetype = "dashed", size = 1) +
  geom_vline(aes(xintercept = quantile(SalePrice, 0.25)), color = "red", linetype = "dashed", size = 0.8) +
  geom_vline(aes(xintercept = quantile(SalePrice, 0.75)), color = "blue", linetype = "dashed", size = 0.8) +
  facet_wrap(~ HasPool) +
  labs(title = "Density of Sale Prices by Pool Presence",
       x = "Sale Price", y = "Density", fill = "Has Pool") +
  scale_fill_manual(values = c("No Pool" = "lightblue", "Has Pool" = "orange")) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
    axis.title = element_text(size = 14),
    axis.text = element_text(size = 12)
  )

  • Sale Price of House by availability of Pool

    This plot compares sale price distributions for homes with and without pools using overlaid boxplots and individual data points. Homes without pools (left) cluster around a lower median with tight interquartile ranges and outliers into higher price ranges. In contrast, homes with pools (right) exhibit slightly higher median prices, broader interquartile ranges, and fewer outliers, indicating a more consistent valuation at higher prices. The red arrows emphasize higher median prices for pool-equipped homes, highlighting the influence of pools on home values towards higher sale prices.

ggplot(ames_housing, aes(x = factor(HasPool), y = SalePrice, fill = factor(HasPool))) +
  geom_boxplot(alpha = 0.7, outlier.shape = NA) +
  geom_jitter(width = 0.2, alpha = 0.5, color = "black") +
  stat_summary(fun = median, geom = "point", shape = 18, size = 4, color = "red") +
  stat_summary(fun.data = function(x) {
    quantiles <- quantile(x, c(0.25, 0.75))
    data.frame(y = quantiles, ymin = quantiles[1], ymax = quantiles[2])
  }, geom = "errorbar", width = 0.2, color = "blue") +
  labs(title = "Sale Prices by Pool Presence",
       x = "Has Pool", y = "Sale Price") +
  scale_fill_manual(values = c("No Pool" = "lightblue", "Has Pool" = "orange")) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
    axis.title = element_text(size = 14),
    axis.text = element_text(size = 12)
  )

  • Sale Price vs Lot Area of House by presence of Pool

    This scatter plot depicts the relationship between lot area and sale prices for homes with pools. The blue trend line suggests a positive correlation, indicating that as lot area increases, so does sale price, reflecting the added value of larger lots accommodating pools. Clustering at lower lot sizes with a wide price range suggests even smaller lots with pools can fetch high prices, highlighting the significant value addition of pools across various lot sizes. Outliers at higher lot sizes suggest other factors like location and amenities also influence sale prices.

ggplot(ames_housing, aes(x = LotArea, y = SalePrice, color = factor(HasPool))) +
  geom_point(alpha = 0.6) +
  geom_smooth(method = "lm", se = FALSE, aes(group = 1), color = "blue") +  
  labs(title = "Sale Price vs. Lot Area by Pool Presence",
       x = "Lot Area (sq feet)", y = "Sale Price",
       color = "Has Pool") +  # Add legend title
  scale_color_manual(values = c("No Pool" = "red", "Has Pool" = "green")) +  
  theme_minimal() +
  theme(
    plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
    axis.title = element_text(size = 14),
    axis.text = element_text(size = 12),
    legend.position = "bottom"  
  )
## `geom_smooth()` using formula = 'y ~ x'

  1. How do neighborhood amenities affect the sale price of a house?

    Comprehensive data analysis reveals the crucial role of neighborhood amenities in determining house sale prices. Variation in median sale prices across the top 10 neighborhoods, with areas like NridgHt, NoRidge, and StoneBr consistently commanding higher prices, underscores their desirability and likely higher affluence. Sale price distributions within neighborhoods highlight the complexity of real estate pricing, influenced by factors like location and property features. These insights emphasize the importance of considering neighborhood amenities when assessing sale prices, as they significantly impact market dynamics and buyer perceptions.

  • Median Sale Price of House among top 10 Neighbourhood

    This boxplot illustrates median sale prices across the top 10 neighborhoods, revealing variations in housing market dynamics. Neighborhoods like NridgHt, NoRidge, and StoneBr show higher median prices, indicating greater affluence or desirability. Box lengths represent price range variability within each neighborhood, with wider ranges like Veenker suggesting diverse markets. Outliers denote sales significantly deviating from typical prices, possibly due to unique features or conditions. Overall, the plot underscores neighborhood choice’s substantial impact on home prices in the real estate market.

my_colors <- RColorBrewer::brewer.pal(10, "Set3")  
if (length(my_colors) < 10) {
  my_colors <- colorRampPalette(my_colors)(10)  
}

top_neighborhoods <- ames_housing %>%
  group_by(Neighborhood) %>%
  summarize(MedianSalePrice = median(SalePrice, na.rm = TRUE), .groups = 'drop') %>%
  top_n(10, MedianSalePrice) %>%
  arrange(desc(MedianSalePrice)) %>%
  pull(Neighborhood)

ames_housing_top <- ames_housing %>%
  filter(Neighborhood %in% top_neighborhoods)

ggplot(ames_housing_top, aes(x = reorder(Neighborhood, SalePrice, FUN = median), y = SalePrice, fill = Neighborhood)) +
  geom_boxplot(alpha = 0.7, outlier.shape = NA) +  
  geom_jitter(width = 0.2, alpha = 0.5, color = "black") +  
  stat_summary(fun = median, geom = "point", shape = 18, size = 4, color = "red") +  
  stat_summary(fun.data = function(x) {
    quantiles <- quantile(x, c(0.25, 0.75))
    data.frame(y = quantiles, ymin = quantiles[1], ymax = quantiles[2])
  }, geom = "errorbar", width = 0.2, color = "blue") +  
  scale_fill_manual(values = my_colors) +
  labs(title = "Median Sale Prices Across Top 10 Neighborhoods",
       x = "Neighborhood", y = "Median Sale Price") +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),  
    plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
    axis.title = element_text(size = 14),
    axis.text = element_text(size = 12)
  )

  • Density of Sale Price in top 10 Neighborhood

    This plot of densities illustrates the way prices are distributed in ten neighborhoods, each marked by a different color. Neighborhoods such as StoneBr and NridgHt appear to be more affluent due to the higher price peaks that they record, whereas low spikes on Blmngtn and CollgCr indicate cheapness. On the other hand, the red broken lines indicate middle and average values while broader NridgHt and StoneBr point to various types of houses/payments in them respectively. General speaking it demonstrates real estate characteristics at present.

ggplot(ames_housing_top, aes(x = SalePrice, fill = Neighborhood)) +
  geom_density(alpha = 0.6, color = "black") +  
  geom_vline(aes(xintercept = median(SalePrice)), color = "red", linetype = "dashed", size = 1) +  
  stat_function(
  fun = dnorm, 
  args = list(mean = mean(ames_housing_top$SalePrice), sd = sd(ames_housing_top$SalePrice)),
  aes(x = SalePrice),  # explicitly define x
  inherit.aes = FALSE,  # prevent it from using 'fill = Neighborhood'
  color = "blue", 
  linetype = "dotted"
) +  
  scale_fill_manual(values = my_colors) +
  labs(title = "Density of Sale Prices Across Top 10 Neighborhoods",
       x = "Sale Price", y = "Density") +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
    axis.title = element_text(size = 14),
    axis.text = element_text(size = 12)
  )

  • Sale Prices Across Top 10 Neighborhoods Over Time

    This scatter plot visually represents house prices in the leading 10 different residential areas during 2006-2010 illustration of real estate comprehend housing market dynamics and fluctuations through visual representation. The colors represent years and give trends on pricing based on economics. Areas like NridgHt, NoRidge, and StoneBr consistently have pricey properties which shows how much buyers are willing to pay mainly due to their desirability among other factors like wealthiness’ ; whereas price ranges from high to low within each area’s house listings reflect differences of property value among them may be because of land size or nature of home design (includes living room sizes). These point out towards a Highly perplexing-sounding statement partially attributed to the fact that data.

ggplot(ames_housing_top, aes(x = reorder(Neighborhood, SalePrice, FUN = median), y = SalePrice, color = as.factor(YrSold))) +
  geom_jitter(alpha = 0.6, width = 0.3) +
  geom_boxplot(alpha = 0, outlier.shape = NA, width = 0.2) +  
  labs(title = "Sale Prices Across Top 10 Neighborhoods Over Time",
       x = "Neighborhood", y = "Sale Price",
       color = "Year Sold") +
  scale_color_discrete(name = "Year Sold") +  
  theme_minimal() +
  theme(
    legend.position = "bottom",
    plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
    axis.title = element_text(size = 14),
    axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1),
    axis.text.y = element_text(size = 12)
  )

  • Time Series of Median Sale Price in Top 10 Neighborhood

    This scatter plot depicts sale prices across the top 10 neighborhoods from 2006 to 2010, highlighting housing market trends over five years. Color-coded by year, it shows temporal patterns and economic impacts on prices. Certain neighborhoods consistently command higher prices, indicating desirability or affluence, while price spreads within neighborhoods suggest varying property values. The data unveil patterns or shifts in market dynamics, potentially linked to broader economic factors or local developments, aiding in understanding home price influences over time.

neighborhood_yearly_top <- ames_housing_top %>%
  group_by(Neighborhood, YrSold) %>%
  summarize(MedianSalePrice = median(SalePrice, na.rm = TRUE), .groups = 'drop')

ggplot(neighborhood_yearly_top, aes(x = factor(YrSold), y = MedianSalePrice, group = Neighborhood, color = Neighborhood)) +
  geom_line() +
  scale_color_brewer(type = "qual", palette = "Paired") +  
  labs(title = "Time Series of Median Sale Prices by Top 10 Neighborhoods",
       x = "Year Sold", y = "Median Sale Price",
       color = "Neighborhood") +
  theme_minimal() +
  theme(
    legend.position = "bottom",
    plot.title = element_text(size = 14, face = "bold", hjust = 0.5),
    axis.title = element_text(size = 14),
    axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1),
    axis.text.y = element_text(size = 12)
  )

  • 2D Density Map of Sale Prices and Lot Area by Top 10 Neighborhoods

    This 2D density map showcases sale prices relative to lot area across the top 10 neighborhoods. Denser colors indicate higher sales concentration at specific price points and lot sizes. The plot highlights two main concentrations: one at lower prices and smaller lots, and another, less dense area at higher prices and larger lots. This reflects typical property characteristics within neighborhoods, with smaller, more affordable homes dominating, while a smaller segment features larger, pricier properties. The visualization aids in understanding real estate trends and informs property investment decisions based on lot size and expected sale price ranges.

ggplot(ames_housing_top, aes(x = LotArea, y = SalePrice, color = Neighborhood)) +
  geom_point(alpha = 0.6) +
  geom_density_2d_filled(contour_var = "ndensity", aes(fill = ..level..)) +
  scale_color_manual(values = my_colors) +  
  scale_fill_manual(values = my_colors) +  
  labs(title = "2D Density Map of Sale Prices and Lot Area by Top 10 Neighborhoods",
       x = "Lot Area", y = "Sale Price") +
  theme_minimal()
## Warning: The dot-dot notation (`..level..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(level)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

  • Plotting Maps of Top 10 Neighborhood where Sale Price of House is at Maximum

    The R code generates an interactive Leaflet map showcasing median sale prices in selected Ames, Iowa neighborhoods. Markers are color-coded—red for homes above $200,000 and blue for those below. The map includes polylines and a polygon to highlight top-priced neighborhoods and offers interactive features like zoom controls and layer toggles, aiding stakeholders in analyzing the real estate market efficiently.

median_prices <- ames_housing %>%
  group_by(Neighborhood) %>%
  summarize(MedianSalePrice = median(SalePrice, na.rm = TRUE), .groups = 'drop')

neighborhoods_from_plot <- c("Blmngtn", "ClearCr", "CollgCr", "Crawfor", "NoRidge", 
                             "NridgHt", "Somrst", "StoneBr", "Timber", "Veenker")

filtered_data <- median_prices %>%
  filter(Neighborhood %in% neighborhoods_from_plot)

# Manually inputing coordinates for Ames Neighborhood
neighborhood_coords <- data.frame(
  Neighborhood = neighborhoods_from_plot,
  Latitude = c(42.05905, 41.6668, 42.02109528, 42.020579, 42.05055618, 
               42.05963516, 41.6449, 42.06128, 41.72098, 42.02369),
  Longitude = c(-93.63793, -93.6668, -93.68562317, -95.3811884, -93.62717438, 
                -93.65499878, -91.48731, -93.63313, -91.47446, -93.64669)
)

full_data <- merge(neighborhood_coords, filtered_data, by = "Neighborhood")

pal <- colorNumeric(palette = "Viridis", domain = full_data$MedianSalePrice)
final_map <- leaflet(full_data) %>%
  addTiles() %>%  
  setView(lng = -93.6250, lat = 42.0308, zoom = 12)

final_map <- final_map %>%
  addAwesomeMarkers(
    ~Longitude, ~Latitude,
    icon = makeAwesomeIcon(
      icon = 'home', 
      markerColor = ~ifelse(MedianSalePrice > 200000, 'red', 'blue')
    ),
    popup = ~paste(Neighborhood, "<br> Median Sale Price: $", format(MedianSalePrice, big.mark=",", scientific=FALSE))
  )

top_neighborhoods <- full_data %>%
  top_n(3, MedianSalePrice) %>%
  arrange(desc(MedianSalePrice))
final_map <- final_map %>%
  addPolylines(
    lng = top_neighborhoods$Longitude,
    lat = top_neighborhoods$Latitude,
    color = "red",
    weight = 5,
    opacity = 0.7
  )

final_map <- final_map %>%
  addPolygons(
    lng = c(min(top_neighborhoods$Longitude) - 0.01, max(top_neighborhoods$Longitude) + 0.01, 
            max(top_neighborhoods$Longitude) + 0.01, min(top_neighborhoods$Longitude) - 0.01),
    lat = c(min(top_neighborhoods$Latitude) - 0.01, min(top_neighborhoods$Latitude) - 0.01,
            max(top_neighborhoods$Latitude) + 0.01, max(top_neighborhoods$Latitude) + 0.01),
    fillColor = "#ff7800",
    fillOpacity = 0.5,
    weight = 3,
    color = "orange",
    opacity = 0.8
  )

final_map <- final_map %>%
  addLayersControl(
    overlayGroups = c("Price Markers", "Top Priced Route"),
    options = layersControlOptions(collapsed = FALSE)
  )

final_map
  1. How do market dynamics influence the sale price of a house?

    Market dynamics significantly impact house sale prices, as revealed by comprehensive data analysis spanning several years. Fluctuations in average sale prices, depicted by line graphs and heatmaps, illustrate the influence of broader economic conditions and seasonal factors on real estate markets. Sharp declines in prices, such as those in 2007, followed by subsequent recoveries and fluctuations, suggest market volatility possibly influenced by events like the global financial crisis. Peaks and troughs in monthly sale prices indicate seasonal trends or market activity variations throughout the year. Overall trends reveal broader market trends over time, like a decline in property values from 2008 onwards, likely attributed to economic downturns. Understanding these dynamics is crucial for informed decisions regarding property investments and market strategies.

  • Average Sale Price Over time by Year

    This line graph shows average home sale prices from 2006 to 2010, with each year depicted by a different colored line. Fluctuations in prices, like the sharp decline in 2007 followed by a recovery in 2008, hint at market volatility possibly linked to economic events such as the global financial crisis. The subsequent ups and downs in 2009 and 2010 indicate continued market instability or other economic pressures influencing home values. Such insights are vital for investors and policymakers navigating real estate markets.

if (!"YrSold" %in% names(ames_housing) | !"MoSold" %in% names(ames_housing)) {
  stop("YrSold and/or MoSold columns are missing")
}

ames_housing$DateSold <- as.Date(paste(ames_housing$YrSold, ames_housing$MoSold, "01", sep = "-"), format = "%Y-%m-%d")

ames_housing$Year <- factor(ames_housing$YrSold)

daily_avg_prices <- ames_housing %>%
  group_by(DateSold, Year) %>%
  summarize(AveragePrice = mean(SalePrice, na.rm = TRUE), .groups = 'drop')

color_palette <- colorRampPalette(brewer.pal(9, "Set1"))(length(unique(ames_housing$Year)))

ggplot(daily_avg_prices, aes(x = DateSold, y = AveragePrice, group = Year, color = Year)) +
  geom_line(size = 1.5, alpha = 0.8) +  
  scale_color_manual(values = color_palette) +  
  labs(title = "Average Sale Prices Over Time by Year",
       x = "Date Sold", y = "Average Sale Price") +
  theme_minimal() +
  theme(
    legend.position = "bottom",  
    legend.title = element_text(size = 14, face = "bold"),  
    legend.text = element_text(size = 12),  
    axis.text = element_text(size = 12),  
    axis.title = element_text(size = 14)  
  )

  • Monthly Average Sales Price

    This line graph tracks monthly average home sale prices from 2006 to 2010, with a solid blue line depicting month-to-month fluctuations and a dashed red line showing the overall trend. Peaks and troughs in the blue line hint at seasonal trends or market volatility, while the downward trend in the red line post-2008 may reflect broader economic challenges impacting property values. This visualization provides insight into how external economic conditions and seasonal factors influence real estate dynamics.

ames_housing$MonthYear <- as.Date(paste(ames_housing$YrSold, ames_housing$MoSold, "01", sep = "-"), "%Y-%m-%d")

monthly_prices <- ames_housing %>%
  group_by(MonthYear) %>%
  summarize(AveragePrice = mean(SalePrice, na.rm = TRUE))

ggplot(monthly_prices, aes(x = MonthYear, y = AveragePrice)) +
  geom_line(color = "dodgerblue", size = 1.2) +  
  geom_smooth(method = "loess", se = FALSE, color = "red", linetype = "dashed") +  
  labs(title = "Monthly Average Sale Prices",
       x = "Month-Year", y = "Average Sale Price") +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 20, face = "bold"),
    axis.title = element_text(size = 14), 
    axis.text = element_text(size = 12),  
    legend.position = "none"  
  )
## `geom_smooth()` using formula = 'y ~ x'

  • Heatmap of Average Sales Price by Month and Year

    This heatmap visualizes average sale prices from 2006 to 2010, with warmer colors indicating higher prices and cooler colors representing lower prices. Patterns of warmer hues suggest spikes in prices, possibly due to increased market activity, while cooler tones indicate downturns. This visualization offers insights into how real estate prices fluctuate over time, reflecting market dynamics and trends.

monthly_prices$Year <- year(monthly_prices$MonthYear)
monthly_prices$Month <- factor(month(monthly_prices$MonthYear, label = TRUE), levels = month.abb)

ggplot(monthly_prices, aes(x = Month, y = Year, fill = AveragePrice)) +
  geom_tile(color = "white") +  
  scale_fill_gradient(low = "lightblue", high = "darkred") +  
  labs(title = "Heatmap of Average Sale Prices by Month and Year",
       x = "Month", y = "Year") +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 16, face = "bold"),  
    axis.title = element_text(size = 14),  
    axis.text = element_text(size = 12),  
    legend.title = element_text(size = 14),  
    legend.text = element_text(size = 12),  
    panel.grid = element_blank()  
  )

  1. How do seasonal trends affect Sale Price of House in Ames, IOWA?

    Seasonal trends significantly impact the sale price of houses in Ames, Iowa, as evidenced by comprehensive data analysis across various visualization techniques. The boxplots and violin plots depicting the relationship between sale prices and overall quality reveal a consistent trend: as the quality rating increases, median sale prices generally rise, indicating that higher-quality homes command higher prices in the market. However, significant variability within each quality segment suggests that additional factors beyond quality also influence sale prices. Similarly, the scatter plot highlights a positive correlation between higher quality ratings and sale prices, with houses rated 8, 9, and 10 showing a broader range of prices, indicating varying buyer perceptions of additional qualities or features at these levels. Additionally, the bar chart illustrating median sale prices across different neighborhoods segmented by season demonstrates significant variations in prices both across neighborhoods and seasons. This analysis underscores the nuanced impact of seasonal trends on housing prices, with certain neighborhoods possibly achieving higher median prices in specific seasons due to market dynamics influenced by seasonal factors. Overall, understanding these seasonal trends is essential for potential home buyers and sellers, as well as real estate professionals, to make informed decisions in the Ames, Iowa housing market.

  • Seasonal Trends in Sales Price

    This boxplot reveals seasonal trends in home sale prices, with winter (green) showing lower prices and spring (orange) and summer (blue) indicating higher prices, likely due to increased market activity in warmer months. Fall (pink) sees a slight decline from summer prices. The range of prices within each season suggests variability influenced by factors like home features and neighborhood desirability, highlighting the impact of seasonal trends on real estate values.

ames_housing$Season <- factor(
  cut(ames_housing$MoSold, breaks = c(0, 3, 6, 9, 12), labels = c("Winter", "Spring", "Summer", "Fall")),
  levels = c("Winter", "Spring", "Summer", "Fall")
)

ggplot(ames_housing, aes(x = Season, y = SalePrice, fill = Season)) +
  geom_boxplot(outlier.shape = NA, alpha = 0.8) +  
  geom_jitter(width = 0.2, size = 2, alpha = 0.5, color = "black") +  
  scale_fill_brewer(palette = "Set2") +  
  labs(title = "Seasonal Trends in Sale Prices", x = "Season", y = "Average Sale Price") +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 20, face = "bold"),  
    axis.title = element_text(size = 14),  
    axis.text = element_text(size = 12),  
    legend.position = "none"  
  )

  • Density of Sales Price of House by Season

    The density plot reveals seasonal fluctuations in house sale prices. Summer reflects lower prices, while fall and spring indicate more dynamic markets with broader price spreads. Winter shows similar lower-end prices but less activity in higher ranges, suggesting a slowdown in sales of expensive homes during colder months.

ggplot(ames_housing, aes(x = SalePrice, fill = Season)) +
  geom_density(alpha = 0.6) +
  scale_fill_brewer(palette = "Set2") +  
  labs(title = "Density of Sale Prices by Season",
       x = "Sale Price", y = "Density") +
  theme_minimal() +
  theme(
    legend.position = "top",  
    plot.title = element_text(size = 20, face = "bold"),  
    axis.title = element_text(size = 14),  
    axis.text = element_text(size = 12)  
  )

  • Violin Plots of Sales Price by Season

    The violin plot showcases sale price distributions across seasons. Winter displays a wider base with lower prices and fewer high-value sales, while spring and summer exhibit concentrated distributions around the median with occasional higher-priced sales. Fall presents a symmetrical distribution with a slight skew towards higher values. Accompanying box plots offer insights into median sale prices and their variability, emphasizing seasonal fluctuations in the housing market.

ggplot(ames_housing, aes(x = Season, y = SalePrice, fill = Season)) +
  geom_violin(trim = FALSE, alpha = 0.8) +  
  geom_boxplot(width = 0.1, fill = "white", outlier.shape = NA) +  
  labs(title = "Violin Plots of Sale Prices by Season",
       x = "Season", y = "Sale Price") +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 20, face = "bold"),  
    axis.title = element_text(size = 14),  
    axis.text = element_text(size = 12), 
    legend.position = "none"  
  )

seasonal_stats <- ames_housing %>%
  group_by(Season) %>%
  summarize(
    Average = mean(SalePrice, na.rm = TRUE),
    Median = median(SalePrice, na.rm = TRUE),
    Variance = var(SalePrice, na.rm = TRUE),
    SD = sd(SalePrice, na.rm = TRUE)
  )

print(seasonal_stats)
## # A tibble: 4 × 5
##   Season Average Median    Variance     SD
##   <fct>    <dbl>  <dbl>       <dbl>  <dbl>
## 1 Winter 181961. 165250 8229839609. 90718.
## 2 Spring 174271. 156750 5039986048. 70993.
## 3 Summer 187248. 171000 7285022935. 85352.
## 4 Fall   185773. 167500 5910101316. 76877.
  • Seasonal Trends affect in Sale Price of Houses in Neighborhood

    The bar chart displays median sale prices across Ames neighborhoods by season. Each neighborhood is depicted with colors representing different seasons: Winter (blue), Spring (yellow), Summer (purple), Fall (red), and “NA” (gray) for unclassified data. It reveals significant price variations among neighborhoods and seasons. Neighborhoods like NridgHt and NoRidge consistently command higher prices, while IDOTRR and MeadowV generally have lower median prices. Seasonal fluctuations are evident, with certain neighborhoods experiencing higher median prices in specific seasons, indicating the impact of seasonal factors on housing prices. This analysis provides valuable insights into neighborhood-specific seasonal trends, aiding both buyers and sellers in navigating the real estate market.

ames_housing$Season <- cut(ames_housing$MoSold,
                           breaks = c(1, 3, 6, 9, 12),
                           labels = c("Winter", "Spring", "Summer", "Fall"),
                           right = FALSE)

median_prices_by_season <- ames_housing %>%
  group_by(Neighborhood, Season) %>%
  summarize(MedianSalePrice = median(SalePrice, na.rm = TRUE), .groups = 'drop')

ggplot(median_prices_by_season, aes(x = Neighborhood, y = MedianSalePrice, fill = Season)) +
  geom_bar(stat = "identity", position = position_dodge()) +
  labs(title = "Seasonal Trends in House Sale Prices by Neighborhood",
       x = "Neighborhood",
       y = "Median Sale Price") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_fill_brewer(palette = "Set1")  

  1. How do quality and condition of a house impact Sale Price of Houses in Ames?

    The visual analyses reveal a consistent relationship between house quality/condition and sale prices in Ames. Both boxplot and violin plot presentations show that higher quality ratings correspond to elevated median sale prices, indicating market preference for better-quality properties. However, variability within each rating suggests other factors influence prices. The scatter plot further highlights a positive correlation between quality ratings and prices, notably in homes rated 8-10, reflecting diverse buyer preferences. While condition also impacts pricing, its effect appears complex, emphasizing the need to consider quality and condition together when assessing property worth in Ames.

  • Boxplot of Sales Price by Quality and Condition of House

    The boxplot indicates higher quality ratings correlate with increased median sale prices, evident from the upward shift in median lines across ratings 1 to 10. Variability within each rating, reflected in whisker lengths and box ranges, suggests other factors influence prices. Homes with the highest ratings show wide price ranges, indicative of diverse buyer perceptions and additional features’ influence. This visualization succinctly demonstrates quality’s impact on home value and variability within quality segments.

palette15 <- colorRampPalette(brewer.pal(9, "Set3"))(15)

ggplot(ames_housing, aes(x = as.factor(OverallQual), y = SalePrice, fill = as.factor(OverallCond))) +
  geom_boxplot() +
  scale_fill_manual(values = palette15) +
  labs(title = "Boxplot of Sale Prices by Quality and Condition",
       x = "Overall Quality", y = "Sale Price", fill = "Overall Condition") +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 20, face = "bold"),  
    axis.title = element_text(size = 14),  
    axis.text = element_text(size = 12), 
    legend.position = "right"  
  )

  • Violin Plots of Sales Price by Overall Quality of House

    The violin plots display sale price distributions across different overall quality levels, with higher levels correlating with higher median prices, notably levels 8, 9, and 10. Thicker sections indicate denser price clusters. Lower quality levels exhibit fewer data points and lower prices, while higher levels show broader distributions, reflecting both increased sales volume and price variability. This visualization succinctly captures the relationship between quality, sale prices, and price dispersion.

ggplot(ames_housing, aes(x = as.factor(OverallQual), y = SalePrice, fill = as.factor(OverallCond))) +
  geom_violin(trim = FALSE, alpha = 0.8) +  
  scale_fill_manual(values = palette15) +  
  labs(title = "Violin Plots of Sale Prices by Quality and Condition",
       x = "Overall Quality", y = "Sale Price",
       fill = "Overall Condition") +  
  theme_minimal() +
  theme(
    plot.title = element_text(size = 20, face = "bold"), 
    axis.title = element_text(size = 14),  
    axis.text = element_text(size = 12),  
    legend.position = "right"  
  )
## Warning: Groups with fewer than two datapoints have been dropped.
## ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
## Groups with fewer than two datapoints have been dropped.
## ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
## Groups with fewer than two datapoints have been dropped.
## ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
## Groups with fewer than two datapoints have been dropped.
## ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
## Groups with fewer than two datapoints have been dropped.
## ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
## Groups with fewer than two datapoints have been dropped.
## ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
## Groups with fewer than two datapoints have been dropped.
## ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
## Groups with fewer than two datapoints have been dropped.
## ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
## Groups with fewer than two datapoints have been dropped.
## ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
## Groups with fewer than two datapoints have been dropped.
## ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
## Groups with fewer than two datapoints have been dropped.
## ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
## Groups with fewer than two datapoints have been dropped.
## ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.

  • Scatter Plot of Sales Price by Overall Quality of House

    The scatter plot illustrates the relationship between sale prices, overall quality, and condition ratings of houses. It demonstrates a positive correlation between higher quality ratings and sale prices, with the highest quality homes exhibiting a wider range of prices. While condition also influences prices, its impact is less pronounced. This visualization emphasizes the significance of quality in determining real estate values.

ggplot(ames_housing, aes(x = as.factor(OverallQual), y = SalePrice, color = as.factor(OverallCond))) +
  geom_jitter(alpha = 0.6, shape = 16, width = 0.2) +  
  scale_color_manual(values = palette15) +  
  labs(title = "Scatter Plot of Sale Prices by Quality and Condition",
       x = "Overall Quality", y = "Sale Price",
       color = "Overall Condition") +  
  theme_minimal() +
  theme(
    plot.title = element_text(size = 20, face = "bold"),  
    axis.title = element_text(size = 14),  
    axis.text = element_text(size = 12)  
  )

quality_condition_stats <- ames_housing %>%
  group_by(OverallQual, OverallCond) %>%
  summarize(
    Count = n(),
    Average = mean(SalePrice, na.rm = TRUE),
    Median = median(SalePrice, na.rm = TRUE),
    Variance = var(SalePrice, na.rm = TRUE),
    SD = sd(SalePrice, na.rm = TRUE),
    .groups = 'drop'
  )

print(quality_condition_stats)
## # A tibble: 52 × 7
##    OverallQual OverallCond Count Average  Median   Variance     SD
##          <int>       <int> <int>   <dbl>   <dbl>      <dbl>  <dbl>
##  1           1           1     1  61000   61000         NA     NA 
##  2           1           3     1  39300   39300         NA     NA 
##  3           2           3     2  47656.  47656. 304773360. 17458.
##  4           2           5     1  60000   60000         NA     NA 
##  5           3           2     2  80750   80750   36125000   6010.
##  6           3           3     3  69167.  67000  141583333. 11899.
##  7           3           4     6  91817.  91950   62821667.  7926.
##  8           3           5     2 117300  117300  994580000  31537.
##  9           3           6     5  69760   72500  710033000  26646.
## 10           3           7     1 120000  120000         NA     NA 
## # ℹ 42 more rows
  • Overall Quality and Overall Condition of Houses in Neighborhood

    This bubble plot effectively captures the relationship between average quality, sales volume, and sale prices across different neighborhoods in the Ames housing dataset. It highlights high-quality, high-priced neighborhoods like “NridgHt,” “NoRidge,” and “StoneBr” with larger, redder bubbles, while neighborhoods such as “OldTown” and “Edwards” exhibit lower prices and quality but higher sales volume, indicated by smaller, bluer bubbles. This visualization aids in understanding neighborhood characteristics and can inform decisions for buyers, developers, and urban planners.

neighborhood_quality_stats <- ames_housing %>%
  group_by(Neighborhood) %>%
  summarize(
    AvgQuality = mean(OverallQual, na.rm = TRUE),
    HouseCount = n(),
    AvgSalePrice = mean(SalePrice, na.rm = TRUE),
    .groups = 'drop'  
  ) %>%
  arrange(desc(AvgQuality))  

ggplot(neighborhood_quality_stats, aes(x = Neighborhood, y = AvgQuality, size = HouseCount, color = AvgSalePrice)) +
  geom_point(alpha = 0.6) +
  scale_color_gradient(low = "blue", high = "red") +  
  scale_size(range = c(3, 12), name = "House Count") +  
  labs(title = "Neighborhood Quality, Volume, and Value",
       x = "Neighborhood",
       y = "Average Quality",
       color = "Average Sale Price",
       size = "House Count") +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 16, face = "bold"),
    axis.title = element_text(size = 14),
    axis.text.x = element_text(angle = 90, hjust = 1),
    legend.position = "right"
  )

  1. What is the relationship between having a garage and the Sale Price of Houses in Ames?

    The visual analyses provide insights into how garage characteristics influence house sale prices in Ames. The boxplot highlights higher median prices for homes with attached or built-in garages, especially accommodating three cars, indicating buyer preference. Conversely, homes without garages or with carports fetch lower prices. The violin and scatter plots reveal that garage capacity and type, particularly three and four-car setups, influence prices, reflecting buyer priorities for functionality and space. Overall, these visuals underscore the significant impact of garage features on property values, aligning with buyer preferences.

  • Impact of Garage on Sale Price of House

    The boxplot succinctly demonstrates how garage type and capacity influence house sale prices in Ames. It reveals that homes with three-car garages, especially attached or built-in, command higher prices, while those with four-car garages surprisingly show lower median sale prices. Conversely, properties lacking a garage or with just a carport fetch lower prices, underscoring the importance of garage space in real estate valuation. This visualization effectively captures the nuanced relationship between garage characteristics and property values.

ggplot(ames_housing, aes(x = as.factor(GarageCars), y = SalePrice, fill = GarageType)) +
  geom_boxplot() +
  stat_summary(fun = mean, geom = "point", shape = 20, size = 3, color = "red") +  
  scale_fill_brewer(palette = "Set3") + 
  labs(title = "Impact of Garage Cars and Type on Sale Price",
       x = "Number of Cars in Garage", y = "Sale Price", fill = "Garage Type") +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 20, face = "bold"),  
    axis.title = element_text(size = 14),  
    axis.text = element_text(size = 12),  
    legend.position = "bottom"  
  )

  • Violin and Scatter Plot of Sale Price by GarageType

    The violin and scatter plots succinctly illustrate the influence of garage capacity and type on house sale prices. Homes without garages fetch the lowest prices, emphasizing the market devaluation for the absence of this feature. As garage capacity increases, prices generally rise, with one-car garages exhibiting moderate prices and two-car garages being the most common. Three-car garages command higher prices, especially when detached, while four-car garages fetch even higher prices, catering to niche luxury markets or specialized needs. This analysis succinctly captures the nuanced relationship between garage characteristics and property values, reflecting buyer preferences and utility considerations.

ggplot(ames_housing, aes(x = as.factor(GarageCars), y = SalePrice, fill = GarageType)) +
  geom_violin(trim = FALSE, alpha = 0.7) +  
  geom_jitter(width = 0.1, alpha = 0.5, color = "black", size = 2) +  
  scale_fill_brewer(palette = "Set3") +
  labs(title = "Violin and Scatter Plot of Sale Prices by Garage Cars and Type",
       x = "Number of Cars in Garage", y = "Sale Price",
       fill = "Garage Type") + 
  theme_minimal() +
  theme(
    plot.title = element_text(size = 14, face = "bold"),  
    axis.title = element_text(size = 12),  
    axis.text = element_text(size = 12)  
  )
## Warning: Groups with fewer than two datapoints have been dropped.
## ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
## Groups with fewer than two datapoints have been dropped.
## ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
## Groups with fewer than two datapoints have been dropped.
## ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.

garage_stats <- ames_housing %>%
  group_by(GarageType, GarageCars) %>%
  summarize(
    Count = n(),
    Average = mean(SalePrice, na.rm = TRUE),
    Median = median(SalePrice, na.rm = TRUE),
    Variance = var(SalePrice, na.rm = TRUE),
    SD = sd(SalePrice, na.rm = TRUE),
    .groups = 'drop'
  )

print(garage_stats)
## # A tibble: 19 × 7
##    GarageType GarageCars Count Average  Median     Variance      SD
##    <chr>           <int> <int>   <dbl>   <dbl>        <dbl>   <dbl>
##  1 2Types              2     1 150000  150000           NA      NA 
##  2 2Types              3     4 147425  158000   1918455833.  43800.
##  3 2Types              4     1 168000  168000           NA      NA 
##  4 Attchd              1   171 136278. 135000    843755687.  29047.
##  5 Attchd              2   560 195987. 187000   2275119112.  47698.
##  6 Attchd              3   138 313434. 295246.  9378842708.  96844.
##  7 Attchd              4     1 206300  206300           NA      NA 
##  8 Basment             1     8 135156. 135750    943945312.  30724.
##  9 Basment             2    11 179054. 164000   5811995008.  76236.
## 10 BuiltIn             1     8 124188. 125000    626566964.  25031.
## 11 BuiltIn             2    50 215000. 214450   1819316461.  42653.
## 12 BuiltIn             3    30 355821. 339084  10133764941. 100667.
## 13 CarPort             1     3 118300  108000   1797670000   42399.
## 14 CarPort             2     6 105793. 105380.   189627800.  13771.
## 15 Detchd              1   179 120346. 119200    895131422.  29919.
## 16 Detchd              2   196 144064. 138500   1505180669.  38797.
## 17 Detchd              3     9 169544. 124000  15248667778. 123485.
## 18 Detchd              4     3 196326. 200000   5120870480.  71560.
## 19 None                0    81 103317. 100000   1076825760.  32815.
  • Garage Area and Neighborhood

    The bar graph presents garage availability by neighborhood and type in the Ames dataset. Neighborhoods like “NAmes” and “OldTown” feature diverse garage options, with attached, built-in, and detached types, catering to various buyer preferences. Predominantly, attached garages emerge as the most common type, indicating a favored design choice. Conversely, “Blmngtn” and “BrDale” show limited garage availability, with minimal attached garages and no other types, possibly due to newer urban planning or space constraints. This visualization aids real estate professionals and homebuyers in assessing garage options tailored to individual needs.

garage_by_neighborhood <- ames_housing %>%
  group_by(Neighborhood, GarageType) %>%
  summarize(
    TotalGarages = n(),  
    AverageCars = mean(GarageCars, na.rm = TRUE),  
    .groups = 'drop'
  ) %>%
  arrange(desc(TotalGarages))  

print(garage_by_neighborhood)
## # A tibble: 93 × 4
##    Neighborhood GarageType TotalGarages AverageCars
##    <chr>        <chr>             <int>       <dbl>
##  1 NAmes        Attchd              148        1.47
##  2 CollgCr      Attchd              122        2.07
##  3 OldTown      Detchd               81        1.63
##  4 NWAmes       Attchd               69        2.01
##  5 NAmes        Detchd               60        1.8 
##  6 Somerst      Attchd               60        2.37
##  7 NridgHt      Attchd               59        2.63
##  8 Gilbert      Attchd               54        2.07
##  9 SawyerW      Attchd               45        2.02
## 10 BrkSide      Detchd               44        1.41
## # ℹ 83 more rows
ggplot(garage_by_neighborhood, aes(x = Neighborhood, y = TotalGarages, fill = GarageType)) +
  geom_bar(stat = "identity", position = position_dodge()) +
  labs(title = "Garage Availability by Neighborhood and Type",
       x = "Neighborhood",
       y = "Total Garages",
       fill = "Garage Type") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
  scale_fill_brewer(palette = "Set1")

Further Pre-processing and Feature Engineering

Sale Price of House by its age of remodeling

The graph illustrates the sale prices of houses segmented by the age since their last remodel and their overall quality rating. The x-axis categorizes homes based on the time elapsed since they were last remodeled, ranging from newly remodeled to those remodeled over 20 years ago. The y-axis represents the sale price, and the data points are color-coded according to a 10-point quality scale, where 1 represents the lowest quality and 10 the highest.

From the graph, it is evident that newly remodeled homes generally achieve higher sale prices, with a noticeable peak in price for those in the highest quality categories. The presence of high-price spikes in the newly remodeled category across multiple quality ratings underscores the value added by recent renovations. Interestingly, even homes remodeled 6-10 and 11-15 years ago in the highest quality ratings (9 and 10) exhibit some high sale price points, suggesting that exceptional quality can sustain higher property values even as the remodel ages.

In contrast, as the remodel age increases beyond 15 years, the maximum sale prices tend to decrease, particularly evident in the “16-20 years” and “Over 20 years” categories. However, homes in these older remodel categories that maintain a high quality rating (8-10) still occasionally reach higher sale prices, indicating that quality remains a significant determinant of price irrespective of the age of the remodel.

Moreover, the graph highlights significant price variability within each remodeling age category, especially among homes with mid-range quality ratings (4-7). This variability suggests that factors beyond the age of remodel and inherent quality—possibly including location, size, or specific home features—are influencing sale prices.

Overall, this visualization effectively demonstrates how recent remodeling and high-quality ratings can drive up home sale prices, while also revealing the sustained value of well-maintained properties even as they age. This insight is crucial for both sellers considering the value of undertaking renovations and buyers evaluating the long-term value of their investments.

ames_housing$AgeSinceRemodel <- ifelse(
  is.na(ames_housing$YearRemodAdd),
  ames_housing$YrSold - ames_housing$YearBuilt,  
  ames_housing$YrSold - ames_housing$YearRemodAdd  
)

ames_housing$AgeCategory <- cut(
  ames_housing$AgeSinceRemodel,
  breaks = c(-Inf, 0, 5, 10, 15, 20, Inf),  
  labels = c("Newly remodeled", "1-5 years", "6-10 years", "11-15 years", "16-20 years", "Over 20 years"),
  include.lowest = TRUE  
)

ggplot(ames_housing, aes(x = AgeCategory, y = SalePrice, fill = AgeCategory, color = as.factor(OverallQual))) +
  geom_violin(trim = FALSE) +
  labs(title = "Sale Price of Houses by Age Since Remodel and Overall Quality",
       x = "Age Since Remodel Category",
       y = "Sale Price",
       color = "Overall Quality") + 
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        legend.position = "right") 
## Warning: Groups with fewer than two datapoints have been dropped.
## ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
## Groups with fewer than two datapoints have been dropped.
## ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
## Groups with fewer than two datapoints have been dropped.
## ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.

Sale Price of House in Top 5 Neighborhood by the age of Remodeling

The provided graph meticulously delineates the impact of remodeling age and overall quality on the sale prices of houses within five specific neighborhoods in Ames, Iowa: CollgCr, Edwards, NAmes, OldTown, and Somerset. Each subplot represents one of these neighborhoods and plots sale price against the age since the house was last remodeled, categorized into six distinct groups ranging from newly remodeled to those remodeled over 20 years ago, with house quality ratings from 4 to 10 as additional variables.

A clear pattern emerges across all neighborhoods, showing that houses that have been recently remodeled typically command higher prices. This trend is particularly pronounced in the Somerset neighborhood, where a wide price dispersion among newly remodeled homes suggests significant differences in house size, features, or possibly the extent of the renovations undertaken. High-quality ratings (9 and 10) consistently correlate with higher sale prices across different remodeling age categories, indicating a strong market preference for superior quality homes.

Each neighborhood displays unique pricing characteristics, likely influenced by local market conditions and demographic factors. For example, OldTown generally shows lower sale prices across all categories compared to the more upscale neighborhoods like CollgCr and Somerset. This might reflect differences in neighborhood desirability, local amenities, or the historical value of the properties.

Additionally, a general decline in prices is observed as the age since last remodel increases. This is evident in neighborhoods like Edwards and NAmes, where older remodels are associated with lower house prices, underscoring the market’s preference for recent updates. This decline also points to the depreciation of home features and the potential need for newer updates to attract buyers.

Overall, the graph effectively encapsulates how recent renovations, coupled with high quality, enhance home values, while also illustrating significant variances in how these factors play out across different neighborhoods, thus reflecting the complex dynamics of the local real estate market.

top_neighborhoods <- ames_housing %>%
  group_by(Neighborhood) %>%
  summarise(Count = n(), .groups = 'drop') %>%
  arrange(desc(Count)) %>%
  top_n(5, Count) %>%
  pull(Neighborhood)

filtered_data <- ames_housing %>%
  filter(Neighborhood %in% top_neighborhoods)

ggplot(filtered_data, aes(x = AgeCategory, y = SalePrice, fill = AgeCategory, color = as.factor(OverallQual))) +
  geom_violin(trim = FALSE) +
  facet_wrap(~Neighborhood, scales = "free_y") +  
  labs(title = "Sale Price of Houses by Age Since Remodel and Overall Quality in Top 5 Neighborhoods",
       x = "Age Since Remodel Category",
       y = "Sale Price",
       color = "Overall Quality") +  
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        strip.text.x = element_text(size = 8, face = "bold"),
        legend.position = "right") 
## Warning: Groups with fewer than two datapoints have been dropped.
## ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
## Groups with fewer than two datapoints have been dropped.
## ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
## Groups with fewer than two datapoints have been dropped.
## ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
## Groups with fewer than two datapoints have been dropped.
## ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
## Groups with fewer than two datapoints have been dropped.
## ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
## Groups with fewer than two datapoints have been dropped.
## ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
## Groups with fewer than two datapoints have been dropped.
## ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
## Groups with fewer than two datapoints have been dropped.
## ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
## Groups with fewer than two datapoints have been dropped.
## ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
## Groups with fewer than two datapoints have been dropped.
## ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
## Groups with fewer than two datapoints have been dropped.
## ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
## Groups with fewer than two datapoints have been dropped.
## ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
## Groups with fewer than two datapoints have been dropped.
## ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
## Groups with fewer than two datapoints have been dropped.
## ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
## Groups with fewer than two datapoints have been dropped.
## ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
## Groups with fewer than two datapoints have been dropped.
## ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
## Groups with fewer than two datapoints have been dropped.
## ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
## Groups with fewer than two datapoints have been dropped.
## ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
## Groups with fewer than two datapoints have been dropped.
## ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
## Groups with fewer than two datapoints have been dropped.
## ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
## Groups with fewer than two datapoints have been dropped.
## ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
## Groups with fewer than two datapoints have been dropped.
## ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
## Groups with fewer than two datapoints have been dropped.
## ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
## Groups with fewer than two datapoints have been dropped.
## ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
## Groups with fewer than two datapoints have been dropped.
## ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
## Groups with fewer than two datapoints have been dropped.
## ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
## Groups with fewer than two datapoints have been dropped.
## ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.

Sale Price of House based on its built year and house style across different Neighborhoods

The graph offers a detailed analysis of the average sale prices of new houses across various neighborhoods in Ames, Iowa, differentiated by house type and overall quality. Each subplot corresponds to a different quality rating (from 4 to 10), showcasing how the average sale prices vary according to both the type of house and the neighborhood.

Starting with the quality rating of 4, which is depicted only for Edwards, the average sale price is markedly lower, hinting at a potentially less desirable location or less appealing house features in this particular category. As we progress to higher quality ratings (5 through 6), a broader range of neighborhoods and house types are represented, showing a general trend of increasing average sale prices with improvements in overall quality. Notably, the diversity in house types (such as 1.5 Finished, 1 Story, 2 Story, Split Foyer, and Split Level) suggests varied buyer preferences and lifestyle needs, which in turn affect sale prices.

By the time we reach quality ratings of 7 and 8, the graph exhibits a more competitive price range across neighborhoods like Birmgham, CollgCr, and Somerst. This middle range of quality indicates robust demand and potentially balanced offers in terms of home features and neighborhood desirability. Interestingly, the price variance within these quality ratings is less pronounced between different house types, implying that quality might be a more dominant factor over house type in buyer decision-making processes at this level.

The subplots for higher qualities (9 and 10) show a pronounced increase in sale prices, with neighborhoods like NridgHt, StoneBr, and Timber featuring prominently. These areas likely offer superior amenities or advantageous locations, which, when combined with high-quality homes, command premium prices. Notably, at the highest quality rating of 10, only a few neighborhoods are represented, highlighting exclusivity and possibly limited availability of top-tier homes.

Overall, this visualization clearly demonstrates the interplay between house type, neighborhood, and overall quality in determining the sale prices of new homes in Ames. The data suggests that while quality consistently drives prices up, neighborhood selection and house type also play critical roles in shaping market values, catering to a range of preferences and financial capabilities among potential buyers. This detailed breakdown serves as a valuable tool for understanding how various factors contribute to housing market dynamics in the region.

ames_housing$IsNew <- ifelse(
    ames_housing$YearBuilt >= (ames_housing$YrSold - 5) & !is.na(ames_housing$YearBuilt), 
    1,  
    ifelse(
        !is.na(ames_housing$YearBuilt),  
        0,
        NA  
    )
)

new_houses_prices_type <- ames_housing %>%
  filter(IsNew == 1) %>%
  group_by(Neighborhood, HouseStyle, OverallQual) %>%
  summarise(AverageSalePrice = mean(SalePrice, na.rm = TRUE), .groups = 'drop') %>%
  arrange(Neighborhood, desc(AverageSalePrice))

neighborhoods_with_new_houses <- new_houses_prices_type %>%
  filter(AverageSalePrice > 0) %>%
  pull(Neighborhood)

ggplot(new_houses_prices_type, aes(x = Neighborhood, y = AverageSalePrice, fill = HouseStyle)) +
  geom_bar(stat = "identity", position = "dodge") +
  facet_wrap(~OverallQual, scales = "free", labeller = label_both) +  
  labs(title = "Average Sale Price of New Houses by Neighborhood, House Type, and Overall Quality",
       x = "Neighborhood",
       y = "Average Sale Price") +
  scale_fill_brewer(palette = "Set2", name = "House Type") +  
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        strip.text.x = element_text(size = 8, face = "bold")) +
  guides(fill = guide_legend(title = "House Type"))

Sale Price of House by Renovation Status accross Neighborhoods

The graph meticulously delineates the relationship between renovation status, overall quality, and average sale prices of houses across a variety of neighborhoods in Ames, offering nuanced insights into real estate market dynamics. It shows that renovated houses invariably command higher prices than their non-renovated counterparts across all levels of quality and in every neighborhood represented. This consistent trend underscores the general market perception that renovations enhance value, supporting a higher resale price.

Notably, the graph breaks down these dynamics across a spectrum of quality ratings from 1 to 10, revealing that the impact of renovations is especially significant in higher-quality homes. For example, in quality ratings 9 and 10, renovated properties in affluent neighborhoods like StoneBr and NridgHt achieve sale prices that are markedly higher than those of non-renovated properties, sometimes by hundreds of thousands of dollars. This suggests a strong buyer preference for turnkey properties in premium locations, where the perceived value added through high-end renovations is greatest.

Conversely, the graph indicates a plateau in the renovation impact within lower-quality segments (ratings 1 to 4). In these categories, even substantial renovations yield only modest increases in sale prices, particularly in less desirable neighborhoods such as Edwards and Bktside. This could reflect a limitation in the market’s willingness to pay premium prices for properties in areas with lower overall appeal, regardless of the improvements made.

Further, the graph illustrates variability in the effect of renovations across different neighborhoods. For instance, while renovated homes in middle-tier neighborhoods like CollgCr and Gilbert see significant price boosts, the same renovations in neighborhoods like IDOTRR and SWISU result in comparatively smaller price differences. This highlights the importance of location as a determinant of renovation ROI, indicating that the same investment in different areas can yield vastly different returns based on local market conditions and buyer preferences.

Overall, the detailed analysis provided by the graph offers crucial insights for homeowners and real estate investors. It suggests that while renovations generally increase property values, the scale of this increase is heavily influenced by the property’s baseline quality and its neighborhood context. Thus, strategic consideration of where and how to invest in renovations can significantly affect the financial outcome of such endeavors in the real estate market.

ames_housing$WasRenovated <- ifelse(
    !is.na(ames_housing$YearRemodAdd) & !is.na(ames_housing$YearBuilt),  
    ifelse(
        ames_housing$YearRemodAdd > ames_housing$YearBuilt,  
        1,  
        0   
    ),
    NA  
)

sale_prices_by_reno_status <- ames_housing %>%
  group_by(Neighborhood, WasRenovated, OverallQual) %>%
  summarise(AverageSalePrice = mean(SalePrice, na.rm = TRUE), .groups = 'drop') %>%
  arrange(Neighborhood, desc(AverageSalePrice))

ggplot(sale_prices_by_reno_status, aes(x = Neighborhood, y = AverageSalePrice, fill = as.factor(WasRenovated))) +
  geom_bar(stat = "identity", position = "dodge") +
  facet_wrap(~OverallQual, scales = "free", labeller = label_both) +  
  scale_fill_manual(values = c("0" = "red", "1" = "green"), labels = c("0" = "Not Renovated", "1" = "Renovated")) +
  labs(title = "Average Sale Price of Houses by Renovation Status, Overall Quality, and Neighborhood",
       x = "Neighborhood",
       y = "Average Sale Price",
       fill = "Renovation Status") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        strip.text.x = element_text(size = 8, face = "bold")) +
  guides(fill = guide_legend(title = "Renovation Status"))

Sale Price of House based on House Style in Neighborhood

The provided graph offers an in-depth look at the average sale prices of houses in the top 10 neighborhoods in Ames, Iowa, categorizing the data by house size (medium and large) and overall quality rating (ranging from 4 to 10). Analyzing the graph reveals that, consistently across neighborhoods, higher quality ratings are associated with higher average sale prices, underscoring the significant impact of property condition and amenities on market value. Additionally, there is a distinct pattern where large houses generally command higher prices than their medium-sized counterparts, particularly evident in higher quality ratings (7 through 10), which suggests a strong market preference for more spacious living accommodations in conjunction with higher quality.

Diving deeper into neighborhood specifics, premium neighborhoods like NridgHt, StoneBr, and Somerset particularly stand out in the highest quality segment (OverallQual: 10). Here, large homes reach the peak of the market in terms of sale prices, indicating these areas are highly sought after, likely due to their superior location, community amenities, or other desirable attributes that complement the high-quality and larger size of homes. This stark contrast in price points across different neighborhoods, especially at the highest quality level, highlights the nuanced interplay between neighborhood desirability, house size, and quality, where each factor amplifies the others.

For instance, while neighborhoods like CollgCr and ClearCr also feature in multiple quality brackets, the premium attached to large, high-quality homes is most pronounced in the most affluent areas, suggesting a tiered market where top-tier buyers have distinct preferences that sharply drive up prices. On the other hand, at lower quality ratings (4 to 6), while there remains a noticeable difference in prices between house sizes, the gap is relatively smaller and less influenced by neighborhood, indicating a more uniform valuation approach that focuses more on basic house attributes rather than premium features or specific neighborhood allure.

This detailed analysis illuminates how real estate values in Ames are shaped by a complex array of factors including the intrinsic attributes of the homes (size and quality) and the extrinsic appeal of their neighborhoods. For investors and homebuyers, understanding these dynamics can guide more informed decisions, pinpointing where the best value or potential for appreciation might lie based on the synergistic effects of quality, size, and location in the local housing market.

ames_housing$TotalSF <- ifelse(
    !is.na(ames_housing$X1stFlrSF) & !is.na(ames_housing$X2ndFlrSF) & !is.na(ames_housing$TotalBsmtSF),
    ames_housing$X1stFlrSF + ames_housing$X2ndFlrSF + ames_housing$TotalBsmtSF,
    NA  
)

small_threshold <- 1000  
medium_threshold <- 2500  
ames_housing$HouseAreaCategory <- cut(ames_housing$TotalSF, 
                                      breaks = c(0, small_threshold, medium_threshold, Inf), 
                                      labels = c("Small", "Medium", "Large"),
                                      include.lowest = TRUE)

overall_neighborhood_avg_price <- ames_housing %>%
  group_by(Neighborhood) %>%
  summarise(AverageSalePrice = mean(SalePrice, na.rm = TRUE), .groups = 'drop') %>%
  arrange(desc(AverageSalePrice)) %>%
  slice_head(n = 10)

top_neighborhoods_data <- ames_housing %>%
  filter(Neighborhood %in% overall_neighborhood_avg_price$Neighborhood)

neighborhood_size_avg_price <- top_neighborhoods_data %>%
  group_by(Neighborhood, HouseAreaCategory, OverallQual) %>%
  summarise(AverageSalePrice = mean(SalePrice, na.rm = TRUE), .groups = 'drop') %>%
  arrange(Neighborhood, desc(AverageSalePrice))

p <- ggplot(neighborhood_size_avg_price, aes(x = reorder(Neighborhood, -AverageSalePrice), y = AverageSalePrice, fill = HouseAreaCategory)) +
  geom_bar(stat = "identity", position = "stack") +
  facet_wrap(~OverallQual, scales = "free", labeller = label_both) +  
  labs(title = "Average Sale Price by House Size and Overall Quality in Top 10 Neighborhoods",
       x = "Neighborhood",
       y = "Average Sale Price",
       fill = "House Size") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        strip.text.x = element_text(size = 8, face = "bold"),
        legend.position = "right")

p_plotly <- ggplotly(p) %>% 
  layout(title = "Average Sale Price by House Size and Overall Quality in Top 10 Neighborhoods",
         xaxis = list(title = "Neighborhood"),
         yaxis = list(title = "Average Sale Price"),
         legend = list(title = list(text = "House Size")),
         hovermode = "closest")

p_plotly

Developing Modeling

Model 2, which examines the relationship between Sale Price and Overall Quality, stands out as the most effective among the four evaluated models for predicting house prices. This model demonstrates an exceptionally strong correlation between overall quality and sale prices, as evidenced by its extremely low p-value (4.518034e-223) and a high statistic value (49.36366). Such results underscore the model’s statistical robustness and reliability. The coefficient of $45,435.80 for each unit increase in quality confirms that higher quality significantly enhances the property’s market value, a conclusion that aligns well with typical market expectations.

In addition, Model 2’s Root Mean Square Error (RMSE) of 48,589.45, while substantial, is the lowest among the models tested, suggesting that it explains the variance in sale prices more accurately than the others. This comparative precision in predicting sale prices, along with the model’s strong alignment with real estate market dynamics—where quality is a crucial determinant of property value—makes it particularly useful for both theoretical analysis and practical applications in the real estate sector. Hence, Model 2 not only offers superior statistical validity but also provides actionable insights that reflect common trends and behaviors in the housing market, making it the most reliable tool for understanding and predicting the impacts of property quality on sale prices.

Model 1: Sale Price vs OverallCond

The graph “Sale Price vs. Overall Condition” visualizes the relationship between the overall condition of houses and their sale prices, highlighting an unexpected trend. Unlike what one might anticipate, the regression line, which is nearly flat with a slight negative slope, suggests that higher overall condition ratings do not correspond to higher sale prices. This is counterintuitive as better condition is typically expected to enhance a home’s value.

The regression analysis provides further details on this relationship. The intercept is approximately $211,909.59, suggesting the base sale price for houses with an overall condition score at zero, a hypothetical scenario for positioning the regression model. More crucially, the coefficient for Overall Condition is -$5,558.12, indicating that, on average, each unit increase in the overall condition rating is associated with a decrease in sale price by this amount. This negative relationship is statistically significant with a p-value of 0.0029, indicating that it is unlikely to have occurred by chance.

Additionally, the RMSE (Root Mean Square Error) value of $79,174.24 reflects a high degree of variability in sale prices that the model based on overall condition alone does not capture. This suggests other factors might play a significant role in determining the sale prices of houses, overshadowing the impact of their overall condition.

This analysis potentially points to market dynamics where buyers might not value incremental improvements in condition as highly as expected, or where other attributes of a property—such as location, size, or modernity—may be driving prices more significantly. The relatively high RMSE also suggests a model incorporating more variables might better explain the variance in sale prices.

In summary, while the model shows a statistically significant negative impact of overall condition on sale prices, the practical interpretation and the high RMSE underscore the complexity of real estate valuation, where multiple factors interact in determining a property’s market value. This serves as an important consideration for sellers and buyers in the real estate market, suggesting that enhancements in condition alone might not always correspond to expected increases in property value.

p <- ggplot(ames_housing, aes(OverallCond, SalePrice)) +
  geom_point() +
  geom_smooth(method='lm', se=FALSE) +
  scale_y_continuous(labels = comma) +
  theme_minimal() +
  labs(title="Sale Price vs. Overall Condition",
       x="Overall Condition",
       y="Sale Price ($)")

m1_cond <- lm(SalePrice ~ OverallCond, data = ames_housing)

predictions <- predict(m1_cond, ames_housing)
residuals <- ames_housing$SalePrice - predictions
rmse <- sqrt(mean(residuals^2))

p + labs(subtitle = paste("RMSE:", round(rmse, 2)))
## `geom_smooth()` using formula = 'y ~ x'

tidy(m1_cond)
## # A tibble: 2 × 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)  211910.    10597.     20.0  8.27e-79
## 2 OverallCond   -5558.     1864.     -2.98 2.91e- 3

Model2: Sale Price vs OverallQual

The graph titled “Sale Price vs. Overall Quality” displays a strong positive correlation between the overall quality of houses and their sale prices, with an added statistical annotation of the Root Mean Square Error (RMSE) of 48,589.45. This RMSE value quantifies the average magnitude of the errors between the predicted sale prices by the model and the actual sale prices, suggesting that the model has a moderate degree of prediction error.

The regression analysis, detailed in the summary statistics provided, strongly supports the visual trend observed in the graph. The intercept of the regression line is approximately -$96,206.08, which, although theoretically represents the expected sale price if a house had an overall quality score of zero, practically serves to adjust the starting point of the regression line within the context of the actual data range. The coefficient for Overall Quality is $45,435.80, indicating that each one-point increase in the overall quality rating is associated with an average increase in sale price of approximately $45,436. This coefficient is very significant statistically, as evidenced by an extremely small p-value (close to 0), which effectively rules out the possibility of this effect occurring by chance.

The large value of the statistic (49.36366) further confirms the robustness of this relationship, implying a very strong influence of overall quality on the sale price. This statistical strength, combined with the practical interpretation of the slope, underscores the critical role that quality plays in the housing market, where higher quality not only commands higher prices but does so in a predictably substantial manner.

While the model demonstrates a significant and strong relationship between quality and price, the RMSE of 48,589.45 also indicates that the model doesn’t capture all variability in the sale prices. This spread, visible as the vertical dispersion of points around the regression line, especially at higher quality ratings, suggests other influencing factors such as location, size, or specific amenities, which might also impact the sale prices but are not accounted for in this single-variable model.

In summary, the analysis clearly demonstrates that improving the overall quality of a house is likely to result in a significant increase in its sale price, although with a quantifiable uncertainty as indicated by the RMSE. This insight is crucial for both buyers, who may be willing to pay a premium for higher quality, and sellers or developers, who might consider quality enhancements as a profitable investment in the property market.

p <- ggplot(ames_housing, aes(OverallQual, SalePrice)) +
  geom_point() +
  geom_smooth(method='lm', se=FALSE) +
  scale_y_continuous(labels = comma) +
  theme_minimal() +
  labs(title="Sale Price vs. Overall Quality",
       x="Overall Quality",
       y="Sale Price ($)")

m2_qual <- lm(SalePrice ~ OverallQual, data = ames_housing)

predictions <- predict(m2_qual, ames_housing)
residuals <- ames_housing$SalePrice - predictions
rmse <- sqrt(mean(residuals^2))

p + labs(subtitle = paste("RMSE:", round(rmse, 2)))
## `geom_smooth()` using formula = 'y ~ x'

tidy(m2_qual)
## # A tibble: 2 × 5
##   term        estimate std.error statistic   p.value
##   <chr>          <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)  -96206.     5756.     -16.7 1.67e- 57
## 2 OverallQual   45436.      920.      49.4 2.19e-313

Model3: Sale Price vs Garage Area

The graph titled “Sale Price vs. Garage Area” illustrates the relationship between the garage area of houses (measured in square feet) and their sale prices. A regression line, depicted in blue, indicates a positive correlation, suggesting that larger garage areas are generally associated with higher house sale prices. This correlation is further quantified in the regression analysis results provided.

The intercept of the regression model is approximately $71,357.42, suggesting that the base price of a house, absent consideration of garage area (i.e., when the garage area is zero), would be estimated at this value. The slope coefficient for the garage area is $231.65. This indicates that for each additional square foot of garage area, the sale price of the house is expected to increase by about $231.65. This relationship is statistically significant, with a p-value effectively at zero (5.265038e-158), reinforcing the strong influence of garage area on house pricing. The statistic of 30.44587 supports the robustness of this relationship.

The RMSE (Root Mean Square Error) of $62,093.07, however, highlights substantial variability in the sale prices that is not captured by the garage area alone. This suggests that while the garage area significantly impacts the sale price, other factors such as location, overall house size, amenities, and property condition also play crucial roles in determining the final sale price. The variability is visually represented by the scatter of data points around the regression line, indicating that while there’s a general trend of increasing prices with larger garages, the spread of prices at each level of garage area is considerable.

In summary, the analysis underscores the importance of garage area in home valuation, which could be particularly relevant for buyers looking for properties with ample garage space or sellers considering renovations that include garage expansions. However, the high RMSE also calls for a cautious interpretation, suggesting that stakeholders should consider multiple property features alongside garage size when assessing house values.

p <- ggplot(ames_housing, aes(GarageArea, SalePrice)) +
  geom_point() +
  geom_smooth(method='lm', se=FALSE) +
  scale_y_continuous(labels = comma) +
  theme_minimal() +
  labs(title="Sale Price vs. Garage Area",
       x="Garage Area (sq. ft.)",
       y="Sale Price ($)")

m3_garage <- lm(SalePrice ~ GarageArea, data = ames_housing)

predictions <- predict(m3_garage, ames_housing)
residuals <- ames_housing$SalePrice - predictions
rmse <- sqrt(mean(residuals^2))

p + labs(subtitle = paste("RMSE:", round(rmse, 2)))
## `geom_smooth()` using formula = 'y ~ x'

tidy(m3_garage)
## # A tibble: 2 × 5
##   term        estimate std.error statistic   p.value
##   <chr>          <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)   71357.   3949.        18.1 5.11e- 66
## 2 GarageArea      232.      7.61      30.4 5.27e-158

Model 4: Sale Price vs Living Area

The graph “Sale Price vs. Living Area” illustrates a positive correlation between the living area of houses (in square feet) and their sale prices, indicated by a rising blue regression line. The Root Mean Square Error (RMSE) is reported as 56,034.3, which signifies the average deviation of the observed sale prices from those predicted by the model, highlighting substantial variability in house prices that cannot be solely explained by living area.

The regression analysis presents a more detailed quantitative relationship: the intercept, calculated at approximately $18,569.03, represents the theoretical sale price for a house with no living area, serving as a baseline figure in the model. More significantly, the slope coefficient for the living area is about $107.13, indicating that for every additional square foot of living area, the sale price of a house increases by this amount on average. This relationship is strongly supported by statistical evidence, shown by the very small p-value (4.518034e-223), which strongly rejects the null hypothesis of no effect of living area on sale price. The statistic of 38.348207 further emphasizes the strength of this relationship.

Despite this clear positive trend, the high RMSE suggests that other factors play a critical role in determining sale prices, such as location, construction quality, age of the property, and market conditions, which are not captured by living area alone. The scatter of points around the regression line, particularly at higher living areas, indicates that while larger homes generally fetch higher prices, the extent of this price increase can vary widely depending on these additional factors.

In conclusion, the analysis robustly demonstrates the significant impact of living area on house pricing, affirming that larger homes typically command higher prices. However, the variability underscored by the RMSE and the scatter around the regression line also calls for considering a broader range of property attributes when evaluating or predicting house prices beyond just the living area. This insight is particularly valuable for stakeholders in the real estate market, including buyers, sellers, and developers, when assessing property values or making investment decisions.

p <- ggplot(ames_housing, aes(GrLivArea, SalePrice)) +
  geom_point() +
  geom_smooth(method='lm', se=FALSE) +
  scale_y_continuous(labels = comma) +
  theme_minimal() +
  labs(title="Sale Price vs. Living Area",
       x="Living Area (sq. ft.)",
       y="Sale Price ($)")

m4_living <- lm(SalePrice ~ GrLivArea, data = ames_housing)

predictions <- predict(m4_living, ames_housing)
residuals <- ames_housing$SalePrice - predictions
rmse <- sqrt(mean(residuals^2))

p + labs(subtitle = paste("RMSE:", round(rmse, 2)))
## `geom_smooth()` using formula = 'y ~ x'

tidy(m4_living)
## # A tibble: 2 × 5
##   term        estimate std.error statistic   p.value
##   <chr>          <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)   18569.   4481.        4.14 3.61e-  5
## 2 GrLivArea       107.      2.79     38.3  4.52e-223

Predicting Sale Prices of Properties (Houses) described by Test Dataset

m1_cond <- lm(SalePrice ~ OverallCond, data = ames_housing)
m2_qual <- lm(SalePrice ~ OverallQual, data = ames_housing)
m3_garage <- lm(SalePrice ~ GarageArea, data = ames_housing)
m4_living <- lm(SalePrice ~ GrLivArea, data = ames_housing)

ameshous_test_data <- ameshous_test_data %>%
  mutate(
    Pred_SalePrice_Cond = predict(m1_cond, newdata = ameshous_test_data),
    Pred_SalePrice_Qual = predict(m2_qual, newdata = ameshous_test_data),
    Pred_SalePrice_Garage = predict(m3_garage, newdata = ameshous_test_data),
    Pred_SalePrice_Living = predict(m4_living, newdata = ameshous_test_data)
  )

ameshous_test_data %>%
  select(OverallCond, OverallQual, GarageArea, GrLivArea, 
         Pred_SalePrice_Cond, Pred_SalePrice_Qual, 
         Pred_SalePrice_Garage, Pred_SalePrice_Living) %>%
  head()
##   OverallCond OverallQual GarageArea GrLivArea Pred_SalePrice_Cond
## 1           6           5        730       896            178560.9
## 2           6           6        312      1329            178560.9
## 3           5           5        482      1629            184119.0
## 4           6           6        470      1604            178560.9
## 5           5           8        506      1280            184119.0
## 6           5           6        440      1655            184119.0
##   Pred_SalePrice_Qual Pred_SalePrice_Garage Pred_SalePrice_Living
## 1            130972.9              240458.7              114557.8
## 2            176408.7              143630.9              160945.3
## 3            130972.9              183010.6              193084.4
## 4            176408.7              180230.9              190406.1
## 5            267280.3              188570.1              155695.9
## 6            176408.7              173281.5              195869.8

Visualising the Predicted Sale Prices of Properties in Test Dataset

The analysis of the “Predicting Sale Prices of properties using Developed Models in Testdata” graph reveals significant variations in the effectiveness of four different predictive models: Condition, Quality, Garage Area, and Living Area. The Quality model stands out with a perfect R-squared of 1.00, indicating that it can predict sale prices with exceptional accuracy, as evidenced by the alignment and consistent upward trend of the blue triangles. This suggests that the model captures all variability in the sale prices based on the quality of properties, though such a perfect score also raises concerns about potential overfitting, suggesting it might perform exceptionally well on test data but could fail to generalize to new, unseen datasets.

In contrast, the Garage Area and Living Area models exhibit moderate predictive capabilities with R-squared values of 0.3228 and 0.3120, respectively. These models, represented by green bars and purple crosses, show wider distributions in predicted prices, indicating a noticeable but inconsistent influence on property values. While they provide useful insights, their predictive power is substantially lower than the Quality model.

The Condition model, with an R-squared of just 0.0092, is visually and statistically the least effective. Its narrow distribution of red bars indicates that it hardly captures any variability in sale prices based on the condition of properties alone.

Given these observations, the Quality model is the best predictor of sale prices in the test data due to its unmatched accuracy as per the R-squared value. However, the potential overfitting indicated by the perfect fit suggests that while it is the most accurate within this specific dataset, caution should be exercised when applying this model to broader datasets. Models like Garage Area and Living Area, despite their lower R-squared values, might offer more reliable and generalizable predictions across different samples.

plot_data <- ameshous_test_data %>%
  select(OverallCond, GarageArea, GrLivArea, OverallQual, Pred_SalePrice_Cond, Pred_SalePrice_Qual, Pred_SalePrice_Garage, Pred_SalePrice_Living) %>%
  pivot_longer(cols = starts_with("Pred"), names_to = "Model", values_to = "PredictedPrice")

plot_data$Model <- factor(plot_data$Model, levels = c("Pred_SalePrice_Cond", "Pred_SalePrice_Qual", "Pred_SalePrice_Garage", "Pred_SalePrice_Living"),
                          labels = c("Condition", "Quality", "Garage Area", "Living Area"))

ggplot(plot_data, aes(x = Model, y = PredictedPrice, color = Model, shape = Model)) +
  geom_point(alpha = 0.6, size = 3) +  
  scale_color_brewer(palette = "Set1") +  
  geom_smooth(method = "lm", se = FALSE, aes(group = Model), linetype = "dashed") +
  labs(title = "Predicting Sale Prices of properties using Developed Models in Testdata",
       x = "Models",
       y = "Predicted Sale Price",
       color = "Model",
       shape = "Model") +
  theme_minimal() +
  theme(legend.position = "bottom") +
  guides(color = guide_legend(override.aes = list(size = 5)),
         shape = guide_legend(override.aes = list(size = 5)))
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 1 row containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Duplicated `override.aes` is ignored.
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).

rsquared <- plot_data %>%
  group_by(Model) %>%
  summarise(Rsquared = summary(lm(PredictedPrice ~ OverallQual))$r.squared, .groups = "keep")

cat("R-squared values for each model:\n")
## R-squared values for each model:
print(rsquared)
## # A tibble: 4 × 2
## # Groups:   Model [4]
##   Model       Rsquared
##   <fct>          <dbl>
## 1 Condition    0.00919
## 2 Quality      1      
## 3 Garage Area  0.323  
## 4 Living Area  0.312
cat("\n")
best_model <- rsquared$Model[which.max(rsquared$Rsquared)]
cat("Best model (Highest R-squared):", best_model, "\n")
## Best model (Highest R-squared): 2

Evaluating the model

Model Assessment

The summary table comparing regression models based on various property attributes reveals distinct patterns in their predictive capabilities for sale prices within the Ames housing dataset. Notably, the “Sale Price ~ Overall Quality” model emerges as the strongest performer, boasting an R-squared value of 0.626, indicating that approximately 62.6% of the variability in sale prices can be attributed to the overall quality of houses. This model also exhibits the lowest AIC and BIC values, signifying its efficiency in explaining price variances with minimal parameters. Conversely, the “Sale Price ~ Overall Condition” model demonstrates limited predictive power, with an R-squared value of just 0.006, suggesting that overall condition alone poorly predicts sale prices in this dataset. The “Sale Price ~ Living Area” model showcases substantial explanatory ability, with an R-squared value of 0.502, reinforcing the notion that larger living spaces typically command higher sale prices. Lastly, the “Sale Price ~ Garage Area” model offers moderate predictive capabilities, with an R-squared of 0.389, indicating that while garage area influences sale prices, its impact is comparatively less significant than overall quality and living area. These insights underscore the importance of prioritizing quality and living space attributes in constructing robust predictive models for real estate valuation, enabling more accurate pricing assessments and informed decision-making for buyers and sellers alike.

# Fit the models using the predicted sale prices (e.g., Pred_SalePrice_Cond, etc.)
m1_cond_pred <- lm(Pred_SalePrice_Cond ~ OverallCond, data = ameshous_test_data)
m2_qual_pred <- lm(Pred_SalePrice_Qual ~ OverallQual, data = ameshous_test_data)
m3_garage_pred <- lm(Pred_SalePrice_Garage ~ GarageArea, data = ameshous_test_data)
m4_living_pred <- lm(Pred_SalePrice_Living ~ GrLivArea, data = ameshous_test_data)
# Store the models in a list
models_pred <- list(
  "Sale Price ~ Overall Condition" = m1_cond_pred,
  "Sale Price ~ Overall Quality" = m2_qual_pred,
  "Sale Price ~ Garage Area" = m3_garage_pred,
  "Sale Price ~ Living Area" = m4_living_pred
)

# Summarize the models
modelsummary(models_pred)
Sale Price ~ Overall Condition Sale Price ~ Overall Quality Sale Price ~ Garage Area Sale Price ~ Living Area
(Intercept) 211909.592 -96206.080 71357.421 18569.026
(0.000) (0.000) (0.000) (0.000)
OverallCond -5558.115
(0.000)
OverallQual 45435.803
(0.000)
GarageArea 231.646
(0.000)
GrLivArea 107.130
(0.000)
Num.Obs. 1459 1459 1458 1459
R2 1.000 1.000 1.000 1.000
R2 Adj. 1.000 1.000 1.000 1.000
AIC -52680.2 -53542.2 -56034.1 -59655.0
BIC -52664.3 -53526.3 -56018.2 -59639.1
Log.Lik. 26343.092 26774.085 28020.048 29830.492
F 4.59e+27 9.22e+29 3.1e+30 3.86e+31
RMSE 0.00 0.00 0.00 0.00

Model Diagnostics- Performing Residual Diagnostics

Among the four models, the Quality model initially appears to be the most effective due to its high R-squared value; however, its potential overfitting and heteroscedasticity need to be addressed. Both the Garage Area and Living Area models provide moderate effectiveness with clear paths for improvement. The Condition model, however, shows the least effectiveness and might require a more substantial reevaluation or a different modeling approach altogether.

In conclusion, while each model has its strengths, they all exhibit specific diagnostic challenges that must be addressed to improve their predictive accuracy and reliability. By applying appropriate statistical techniques to handle these issues, these models can be refined to provide more dependable insights into the real estate market dynamics in the Ames dataset.

Model 1: Sale Price vs OverallCond

The diagnostic plots provide a comprehensive assessment of the regression model predicting sale prices based on the overall condition of houses in the Ames dataset. The residuals versus fitted values plot highlights potential issues with linearity and homoscedasticity, as the residuals do not scatter randomly around zero, indicating possible non-linearity or unequal variance in the data. The histogram of residuals suggests non-normality in their distribution, which can affect the reliability of regression estimates. The ‘Posterior Predictive Check’ plot indicates a mismatch between observed and predicted data densities, while the ‘Homogeneity of Variance’ plot reveals heteroscedasticity, confirmed by a significant test (p < .001). Outliers and influential observations are evident in the ‘Normality of Residuals’ and ‘Influential Observations’ plots, respectively, potentially skewing model results. Addressing these issues through variable transformations, robust regression methods, or alternative modeling approaches is essential for enhancing the model’s accuracy and validity, ensuring more reliable insights into the impact of house condition on sale prices in the Ames dataset.

names(ameshous_test_data)
##  [1] "Id"                    "MSSubClass"            "MSZoning"             
##  [4] "LotFrontage"           "LotArea"               "Street"               
##  [7] "Alley"                 "LotShape"              "LandContour"          
## [10] "Utilities"             "LotConfig"             "LandSlope"            
## [13] "Neighborhood"          "Condition1"            "Condition2"           
## [16] "BldgType"              "HouseStyle"            "OverallQual"          
## [19] "OverallCond"           "YearBuilt"             "YearRemodAdd"         
## [22] "RoofStyle"             "RoofMatl"              "Exterior1st"          
## [25] "Exterior2nd"           "MasVnrType"            "MasVnrArea"           
## [28] "ExterQual"             "ExterCond"             "Foundation"           
## [31] "BsmtQual"              "BsmtCond"              "BsmtExposure"         
## [34] "BsmtFinType1"          "BsmtFinSF1"            "BsmtFinType2"         
## [37] "BsmtFinSF2"            "BsmtUnfSF"             "TotalBsmtSF"          
## [40] "Heating"               "HeatingQC"             "CentralAir"           
## [43] "Electrical"            "X1stFlrSF"             "X2ndFlrSF"            
## [46] "LowQualFinSF"          "GrLivArea"             "BsmtFullBath"         
## [49] "BsmtHalfBath"          "FullBath"              "HalfBath"             
## [52] "BedroomAbvGr"          "KitchenAbvGr"          "KitchenQual"          
## [55] "TotRmsAbvGrd"          "Functional"            "Fireplaces"           
## [58] "FireplaceQu"           "GarageType"            "GarageYrBlt"          
## [61] "GarageFinish"          "GarageCars"            "GarageArea"           
## [64] "GarageQual"            "GarageCond"            "PavedDrive"           
## [67] "WoodDeckSF"            "OpenPorchSF"           "EnclosedPorch"        
## [70] "X3SsnPorch"            "ScreenPorch"           "PoolArea"             
## [73] "PoolQC"                "Fence"                 "MiscFeature"          
## [76] "MiscVal"               "MoSold"                "YrSold"               
## [79] "SaleType"              "SaleCondition"         "Pred_SalePrice_Cond"  
## [82] "Pred_SalePrice_Qual"   "Pred_SalePrice_Garage" "Pred_SalePrice_Living"
library(see)
m1_cond_pred <- lm(Pred_SalePrice_Cond ~ OverallCond, data = ameshous_test_data)
# Fit the model using the predicted sale prices (e.g., Pred_SalePrice_Cond)
m1_cond_pred <- lm(Pred_SalePrice_Cond ~ OverallCond, data = ameshous_test_data)

# Augment the model with residuals and fitted values
m1_aug_cond_pred <- augment(m1_cond_pred)

# Residual vs. Fitted Plot
ggplot(data = m1_aug_cond_pred, aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  xlab("Fitted values") +
  ylab("Residuals") +
  theme_minimal()

# Histogram of Residuals
ggplot(data = m1_aug_cond_pred, aes(x = .resid)) +
  geom_histogram(color = 'red', fill = 'skyblue', bins = 30) +
  xlab("Residuals") +
  theme_minimal()

# Model diagnostics: check for heteroscedasticity
check_model(m1_cond_pred)

check_heteroscedasticity(m1_cond_pred)
## Warning: Heteroscedasticity (non-constant error variance) detected (p < .001).
# Compute RMSE (Root Mean Squared Error) based on residuals
rmse_m1_cond_pred <- sqrt(mean(m1_aug_cond_pred$.resid^2))
print(paste("RMSE:", rmse_m1_cond_pred))
## [1] "RMSE: 3.57125872431811e-09"

Model 2: Sale Price vs Overall Quality

The diagnostic plots offer a comprehensive assessment of the regression model’s performance in predicting sale prices based on the overall quality of houses in the Ames dataset. While the model demonstrates reasonable linearity between residuals and fitted values, suggesting a linear relationship, concerns arise regarding heteroscedasticity, as evidenced by non-constant variance in the residuals across the range of fitted values. Additionally, the histogram of residuals indicates some deviation from normality, potentially impacting the reliability of statistical inferences derived from the model. Despite these concerns, the analysis reveals limited influence from outliers, suggesting that extreme data points do not unduly affect the model’s fit. Overall, while the model exhibits strengths in linearity and robustness to outliers, addressing issues such as heteroscedasticity and normality of residuals is essential to enhance the model’s reliability and ensure more accurate predictions of house prices based on overall quality.

# Fit the model using SalePrice ~ OverallQual
m2_qual <- lm(Pred_SalePrice_Qual ~ OverallQual, data = ameshous_test_data)

# Augment the model with residuals and fitted values
m2_aug_qual <- augment(m2_qual)

# Residual vs. Fitted Plot
ggplot(data = m2_aug_qual, aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  xlab("Fitted values") +
  ylab("Residuals") +
  theme_minimal()

# Histogram of Residuals
ggplot(data = m2_aug_qual, aes(x = .resid)) +
  geom_histogram(color = 'red', fill = 'skyblue', bins = 30) +
  xlab("Residuals") +
  theme_minimal()

# Model diagnostics: check for heteroscedasticity
check_model(m2_qual)

check_heteroscedasticity(m2_qual)
## Warning: Heteroscedasticity (non-constant error variance) detected (p < .001).
# Compute RMSE (Root Mean Squared Error) based on residuals
rmse_m2_qual <- sqrt(mean(m2_aug_qual$.resid^2))
print(paste("RMSE:", rmse_m2_qual))
## [1] "RMSE: 2.69945907690479e-09"

Model 3: Sale Price vs GarageArea

The diagnostic plots from the regression model assessing the relationship between garage area and sale price in the Ames housing dataset reveal several critical issues impacting the model’s fitness and reliability. While the residuals vs. fitted values plot indicates reasonable linearity, the spread of residuals increases with higher fitted values, indicating potential heteroscedasticity. Moreover, the histogram of residuals displays some skewness, suggesting deviations from normality that could affect the validity of statistical inferences. Despite the absence of systematic non-linearity, heteroscedasticity presents a significant concern, as it violates the assumptions of homogeneity of variance. However, influential outliers do not seem to substantially influence the model. In summary, while the model captures a linear relationship without significant outliers, addressing heteroscedasticity and non-normality of residuals is essential to enhance its reliability for predicting sale prices based on garage area.

m3_garage <- lm(Pred_SalePrice_Garage ~ GarageArea, data = ameshous_test_data)

m3_aug_garage <- augment(m3_garage)

ggplot(data = m3_garage, aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  xlab("Fitted values") +
  ylab("Residuals") +
  theme_minimal()

ggplot(data = m3_garage, aes(x = .resid)) +
  geom_histogram(color = 'red', fill = 'skyblue', bins = 30) +
  xlab("Residuals") +
  theme_minimal()

check_model(m3_garage)

check_heteroscedasticity(m3_garage)
## Warning: Heteroscedasticity (non-constant error variance) detected (p < .001).
rmse_m3_garage <- sqrt(mean(m3_aug_garage$.resid^2))
print(paste("RMSE:", rmse_m3_garage))
## [1] "RMSE: 1.09785798498115e-09"

Model 4: Sale Price vs GrLivArea (Living Area)

The diagnostic plots from the regression model evaluating the relationship between living area (GrLivArea) and sale price in the Ames housing dataset provide valuable insights into the model’s performance. While the residuals vs. fitted values plot displays a scatter of residuals around the zero line, indicating some level of linearity, noticeable patterns and outliers suggest potential issues with both linearity and homogeneity of variance. The histogram of residuals reveals a roughly symmetrical distribution with slight skewness, indicating minor deviations from normality that could affect the reliability of regression coefficients. Further diagnostic checks confirm the presence of heteroscedasticity, violating a fundamental assumption of OLS regression, and reveal deviations from normality, particularly at extreme values. Although influential observations are mostly within acceptable bounds, these findings collectively suggest that while the model captures a general linear trend, it may not provide the most reliable estimates without adjustments or the application of robust statistical techniques to address these issues. Therefore, further refinements are necessary before considering the model suitable for predictive purposes.

m4_living <- lm(Pred_SalePrice_Living ~ GrLivArea, data = ameshous_test_data)

m4_aug_living <- augment(m4_living)

ggplot(data = m4_living, aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  xlab("Fitted values") +
  ylab("Residuals") +
  theme_minimal()

ggplot(data = m4_living, aes(x = .resid)) +
  geom_histogram(color = 'red', fill = 'skyblue', bins = 30) +
  xlab("Residuals") +
  theme_minimal()

check_model(m4_living)

check_heteroscedasticity(m4_living)
## Warning: Heteroscedasticity (non-constant error variance) detected (p < .001).
rmse_m4_living <- sqrt(mean(m4_aug_living$.resid^2))
print(paste("RMSE:", rmse_m4_living))
## [1] "RMSE: 3.32972431761919e-10"

Additional Diagnostics in Models

The Studentized Breusch-Pagan test results for the four regression models—Condition, Quality, Garage Area, and Living Area—consistently indicate the presence of heteroscedasticity across all models, with significant p-values pointing to violations of the constant variance assumption critical for standard OLS regression analysis. Notably, the Living Area model displays the highest Breusch-Pagan statistic (278.39), suggesting the most severe heteroscedasticity among the models, which reflects the highest variability in error variance relative to the living area’s variance. Although this might seem negative, it also indicates that the living area has a more significant dynamic range of influence on sale prices, potentially capturing a broader spectrum of variance in sale prices than the other models. On the other hand, the Condition model, despite showing the lowest BP statistic (47.089), reveals that condition has the least impact on the variance of sale prices, suggesting it might be the least effective in predicting variations in sale prices. Therefore, while all models exhibit heteroscedasticity, the Living Area model, despite its challenges, might actually offer the richest insights into the dynamics of sale prices due to its broader influence spectrum, making it potentially the most useful for adjustments and improvements in predictive modeling. To enhance their effectiveness, implementing corrective measures such as robust regression techniques or variable transformations would be essential for any of these models before they can provide reliable predictions.

bptest(m1_cond)
## 
##  studentized Breusch-Pagan test
## 
## data:  m1_cond
## BP = 10.695, df = 1, p-value = 0.001074
bptest(m2_qual)
## 
##  studentized Breusch-Pagan test
## 
## data:  m2_qual
## BP = 0.56848, df = 1, p-value = 0.4509
bptest(m3_garage)
## 
##  studentized Breusch-Pagan test
## 
## data:  m3_garage
## BP = 1.4052, df = 1, p-value = 0.2359
bptest(m4_living)
## 
##  studentized Breusch-Pagan test
## 
## data:  m4_living
## BP = 1.581, df = 1, p-value = 0.2086

Normalisation of Target Variable “Sale Price” using log

The histogram provided displays the distribution of the logarithm of sale prices extracted from the Ames housing dataset. This transformation, often employed in regression analyses, aims to normalize positively skewed target variables. Upon analysis, the histogram reveals an approximately symmetrical distribution around the central values, indicating the effectiveness of the logarithmic transformation in normalizing the data. This normalization is advantageous as it aligns with assumptions of many statistical tests and models, particularly those assuming normally distributed errors. By stabilizing variance and reducing the influence of outliers, the transformed data enhances the performance and validity of statistical models, making predictions more reliable. Moreover, using logarithmic sale prices facilitates the capture of relative changes and elasticities in housing prices, enabling more interpretable insights, particularly in economic terms. Overall, the logarithmic transformation proves appropriate for addressing right-skewed sale price distributions in real estate data, ultimately leading to more robust models and better statistical inference and predictions.

ames_housing <- ames_housing %>% 
  mutate(sale_ames = log(SalePrice))

ggplot(ames_housing) +
  geom_histogram(aes(sale_ames), color = "black", fill="orange")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The overall quality of the house shows the higher correlation, and from above analysis also, it is best fitted model among 4 models.

ames_housing %>%
  summarise(cor(OverallQual, sale_ames))
##   cor(OverallQual, sale_ames)
## 1                   0.8171844

The regression analysis on the logarithmically transformed sale prices (sale_ames) against overall quality (OverallQual) in the Ames housing dataset reveals significant insights. The intercept (10.5454550) serves as a baseline for quality’s impact, while the OverallQual coefficient (0.2420126) signifies the estimated increase in sale price for every one-unit rise in quality, both highly statistically significant. The model exhibits a strong fit (R-squared: 0.6677904), explaining about 66.77% of price variability, with a small standard deviation of residuals (Sigma: 0.228989), and high F-statistic (2930.795), indicating model significance. The results suggest an exponential relationship between quality and sale price, making the model valuable for predictive purposes.

m5 <- lm(sale_ames ~ OverallQual, data = ames_housing)
tidy(m5)
## # A tibble: 2 × 5
##   term        estimate std.error statistic p.value
##   <chr>          <dbl>     <dbl>     <dbl>   <dbl>
## 1 (Intercept)   10.6     0.0273      388.        0
## 2 OverallQual    0.236   0.00436      54.1       0
glance(m5)
## # A tibble: 1 × 12
##   r.squared adj.r.squared sigma statistic p.value    df logLik   AIC   BIC
##       <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>  <dbl> <dbl> <dbl>
## 1     0.668         0.668 0.230     2931.       0     1   73.1 -140. -124.
## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

The plot visualizes the residuals versus the fitted values for the regression model (m5) predicting the logarithmic transformation of sale prices based on overall house quality in the Ames housing dataset. The absence of a clear pattern in the residuals suggests reasonable linearity in the model. However, the presence of outliers, especially for higher fitted values, raises concerns about potential model sensitivity to extreme values. Additionally, the slight increase in residual spread with higher fitted values indicates possible heteroscedasticity, challenging the assumption of equal variance. While the model generally fits well, addressing these issues through further diagnostic checks or model adjustments could enhance its predictive accuracy and reliability.

m5_aug <- augment(m5)

ggplot(data = m5_aug, aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  xlab("Fitted values") +
  ylab("Residuals")+
  theme_minimal()

The histogram illustrates the distribution of residuals from your regression model, providing insights into model diagnostics. The bell-shaped curve indicates that the residuals are approximately normally distributed, aligning with the assumption of linear regression. The centered peak around zero suggests unbiased predictions, indicating that, on average, the model accurately estimates sale prices. However, a few outliers and slight skewness towards the negative side hint at potential issues that could affect prediction reliability, especially for extreme values. While the overall distribution supports the model’s validity, addressing outliers and skewness through further investigation or transformations could enhance predictive performance and model robustness.

ggplot(data = m5_aug, aes(x = .resid)) +
  geom_histogram(color = 'red', fill = 'skyblue', binwidth = 0.05) + 
  xlab("Residuals") +
  theme_minimal()

The diagnostic plots provide a comprehensive evaluation of the regression model’s assumptions, crucial for validating its suitability for predictive analysis. The posterior predictive check confirms that the model predictions align well with the observed data distribution. Linearity diagnostics show that residuals are evenly scattered around zero, supporting the assumption of a linear relationship between predictors and the response variable. Homoscedasticity diagnostics indicate consistent variance across fitted values, further validating model assumptions. While influential observations mostly fall within acceptable bounds, minor deviations suggest some points may warrant further scrutiny. The normality plot indicates minor deviations from normality, particularly in the upper tail. Overall, the model appears well-specified, with minor concerns that could be addressed with transformations or robust regression techniques for enhanced precision in predictions, especially at the extremes.

check_model(m5)

The statement “Error variance appears to be homoscedastic (p = 0.705)” indicates that the variability of residuals in the regression model remains consistent across different levels of the independent variable(s). In simpler terms, it means that the spread of errors around the regression line does not systematically change as the predicted values increase or decrease. The p-value of 0.705, which is well above the conventional significance level of 0.05, suggests strong evidence supporting the presence of homoscedasticity. This finding is crucial as it ensures that the standard least squares regression estimates are reliable and that the statistical inferences drawn from them, such as confidence intervals and hypothesis tests, are valid. In conclusion, the model’s adherence to the homoscedasticity assumption enhances the credibility of its predictions and the accuracy of the statistical conclusions derived from it.

check_heteroscedasticity(m5)
## OK: Error variance appears to be homoscedastic (p = 0.705).